[00:00:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:59] 10Data-Engineering, 10Data-Platform-SRE, 10Movement-Insights, 10Product-Analytics: Reconstruct Hive & Hadoop permissions for shared database - https://phabricator.wikimedia.org/T288983 (10Mayakp.wiki) [00:20:04] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Wmfdata-Python triggers a Pandas warning during mariadb.run and hive.run - https://phabricator.wikimedia.org/T324135 (10nshahquinn-wmf) [00:42:10] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines: Enable libmamba by default for conda environment solving - https://phabricator.wikimedia.org/T337258 (10nshahquinn-wmf) @BTullis I just released Wmfdata 2.0.1 with a [few small fixes](https://github.com/w... [00:56:39] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Remove Wmfdata's custom update-notification code - https://phabricator.wikimedia.org/T346706 (10nshahquinn-wmf) [01:01:19] 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: Remove Wmfdata code related to the conda-analytics migration - https://phabricator.wikimedia.org/T346707 (10nshahquinn-wmf) [01:15:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:42] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:38:42] (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:42] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:54] 10Data-Engineering, 10Data-Services, 10cloud-services-team: Surface Temporary user information to Cloud Wiki Replicas - https://phabricator.wikimedia.org/T346679 (10Marostegui) [07:33:36] I just realized that `helmfile` was written by an ex colleague of mine. This is what the name rang a bell [07:57:02] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye [07:59:24] !log redeploying eventgate-analytics in staging T336041 [07:59:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:59:27] T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 [08:02:23] !log redeploying eventgate-analytics-external in staging T336041 [08:02:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:05:07] !log redeploying eventstream-internal in staging T336041 [08:05:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:05:16] T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 [08:10:47] stevemunene: I have successfully deployed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/958496 in staging. Is there an extra step I have to take to deploy to production (only deploy during a given time window, make an announcement of some sort, etc)? [08:12:33] brouberol: I can answer this one :-) No, nothing usually. It's usually straight on to codfw and then eqiad, as long as no problems have been observed in staging. [08:15:34] alrighty then! (I asked Steve because you've been my go to person for any and all questions for the past 2 weeks and you might be getting tired of it) [08:17:01] Not a problem. :-) [08:17:29] !log redeploying eventstream-analytics in eqiad T336041 [08:17:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:17:33] T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 [08:18:57] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:04] 10Data-Platform-SRE: Bring stat1010 into service with GPU from stat1005 - https://phabricator.wikimedia.org/T336040 (10Gehel) Confirmed that the GPU is currently still in stat1005 [08:26:55] team: - I think I'd like to proceed with this kafka-jumbo max-message size settings change today, if that's ok. https://gerrit.wikimedia.org/r/c/operations/puppet/+/952160/ - I have a +1 but I'd be keen for a double-check. [08:27:33] +1! [08:28:17] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [08:36:20] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye completed: - an-worker1146 (**WARN**) - Downtimed on Icinga/Alertm... [08:53:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:21] 10Data-Engineering, 10Data-Platform-SRE, 10Product-Analytics: Conda analytics environments breakage - conflicting dependencies between r-base and other - https://phabricator.wikimedia.org/T343823 (10Gehel) We need a higher level strategy on how we are exposing our compute / data platform. @lbowmaker should b... [09:00:03] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: [SPIKE] Can we identify indicators to inform an SLO for event emission and intake? - https://phabricator.wikimedia.org/T345195 (10gmodena) [09:03:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:54] 10Data-Platform-SRE: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) a:03Stevemunene [09:18:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:48] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10dcaro) [09:27:56] !log deploying change to kafka-jumbo settings for T344688 [09:27:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:27:59] T344688: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 [09:27:59] btullis: here's the tiny Gobblin config change I was talking about, if you have 2 min [09:28:05] https://gerrit.wikimedia.org/r/c/analytics/refinery/+/958511/ [09:31:48] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958511 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [09:32:17] (03CR) 10Brouberol: [C: 03+1] [Gobblin] Add kafka-jumbo1010 to config [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958511 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [09:33:04] brouberol: Looks good +1d it. I would make a reference to it in https://etherpad.wikimedia.org/p/analytics-weekly-train and then it can go out the next time someone deploys refinery, which will likely be this afternoon. [09:33:17] already done [09:33:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:28] however, it seems a +1 + another self-applied +1 isn't enough to merge [09:35:18] do these CRs get merged by the data-eng team when the train leaves? [09:35:19] Do you not have +2 rights yet? I can add you if not. [09:35:29] not on that repo, it seems [09:36:31] Hang on a sec... I can add you. There was some discussion about this kind of thing on Slack yesterday. https://wikimedia.slack.com/archives/CSV483812/p1695041181809139 [09:37:31] Can you try again now please? [09:38:25] > do these CRs get merged by the data-eng team when the train leaves? [09:38:25] I would say, sometimes. Might depend on the comment in the etherpad and/or whether any corresponding changes are required to other repos. [09:38:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:53] (03CR) 10Brouberol: [C: 03+2] [Gobblin] Add kafka-jumbo1010 to config [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958511 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [09:39:01] it worked, thanks! [09:39:05] (03CR) 10Brouberol: [V: 03+2 C: 03+2] [Gobblin] Add kafka-jumbo1010 to config [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958511 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [09:40:25] !log commencing rolling restart of all brokers in kafka-jumbo [09:40:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:42:57] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye [09:48:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:30] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye [10:03:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:05] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye completed: - an-worker1147 (**PASS**) - Downtimed on Icinga/Alertm... [10:38:33] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye completed: - an-worker1148 (**PASS**) - Downtimed on Icinga/Alertm... [10:38:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:41:01] (03PS7) 10Btullis: Update to Superset version 2.2.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [10:44:46] 10Data-Platform-SRE, 10Dumps-Generation, 10cloud-services-team: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10Milimetric) Ok, to resolve I'm going to erase this dvd.html file from all the dumpsdata hosts as [[ https://wikitech.... [10:48:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:56:43] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10Stevemunene) We have successfully completed the hadoop worker upgrades to Bullseye. ` sudo cumin --no-progress a:hadoop-worker 'cat /etc/debian_version' 86 hosts will be targeted: an-worker[1078-1095,1097-1156... [10:57:50] 10Data-Platform-SRE: Upgrade hadoop workers to bullseye - https://phabricator.wikimedia.org/T332570 (10BTullis) >>! In T332570#9177923, @Stevemunene wrote: > We have successfully completed the hadoop worker upgrades to Bullseye. Excellent! [11:08:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:42] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:45] (03PS1) 10Jennifer Ebe: Update changelog for v0.2.22 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/958909 [11:25:39] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/958909 (owner: 10Jennifer Ebe) [11:26:51] Starting build #127 for job analytics-refinery-maven-release-docker [11:38:17] (03PS8) 10Btullis: Update to Superset version 3.0.0 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [11:39:36] Project analytics-refinery-maven-release-docker build #127: 09SUCCESS in 12 min: https://integration.wikimedia.org/ci/job/analytics-refinery-maven-release-docker/127/ [11:40:50] Starting build #86 for job analytics-refinery-update-jars-docker [11:41:12] (03PS1) 10Maven-release-user: Add refinery-source jars for v0.2.22 to artifacts [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957825 [11:41:13] Project analytics-refinery-update-jars-docker build #86: 09SUCCESS in 22 sec: https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars-docker/86/ [11:44:25] (03CR) 10Jennifer Ebe: [V: 03+2 C: 03+2] "Merging for deployment" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957825 (owner: 10Maven-release-user) [12:02:23] !log deploying refinery from deployment.eqiad.wmnet [12:02:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:10:58] 10Data-Engineering, 10Data-Platform-SRE: Define priorities for HDFS data to be backed up - https://phabricator.wikimedia.org/T283261 (10JAllemandou) a:05JAllemandou→03None [12:26:52] btullis: is the kafka-jumbo rr still ongoing? [12:27:08] btullis - hello [12:27:59] btullis: we're having an issue with our deployment - I think it's related to this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/950194 [12:28:06] btullis: maybe [12:28:18] brouberol: Yes 'fraid so. 8 out of 10 have been restarted. [12:28:28] the problem: we (either jennifer_ebe our myself) can't sudo as hdfs on an-launcher1002 [12:28:38] btullis: --^ [12:28:52] no worries! I'm catching up on other tasks in the meantime [12:29:13] joal: OK, shall we jump on a meeting? [12:29:21] sure - batcave [12:42:51] (HdfsRpcQueueLength) firing: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [12:44:06] haven't seen --^ in a while :D [12:44:12] and general question: I see that we're waiting for >=900s when we restart kafka services until we allow ourselves to move on to the next one. I was wondering if it would make sense to provide a command `kafka-check-broker-insync --broker-id xxxx` that would check whether all assigned partitions are back into the ISR. This way, if the broker catches [12:44:12] up in less than 900s, we'd be able to have faster rolling restarts, and if the broker catches up in *more* than 900s, we'd avoid putting the cluster at risk by rebooting another broker. [12:44:42] brouberol: yes definitely we never really got to that point [12:45:07] * brouberol makes a note to create a ticket [12:45:52] note: this is how we were doing w/ kafka in kube: the readiness probe of the pod was checking for the in-sync status of all partitions assigned to the broker + some additional wait time for good measure [12:46:09] in $PREV_JOB I mean [12:47:51] (HdfsRpcQueueLength) resolved: RPC call queue length on the analytics-hadoop cluster is too high. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_Namenode_RPC_length_queue - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsRpcQueueLength [12:49:16] (03PS1) 10Btullis: Update refinery-deploy-to-hdfs to use sudo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958927 (https://phabricator.wikimedia.org/T334493) [12:50:22] (03PS2) 10Btullis: Update refinery-deploy-to-hdfs to use sudo [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958927 (https://phabricator.wikimedia.org/T334493) [12:51:05] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging to fix deploy :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958927 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [12:56:53] 10Data-Platform-SRE: Use the replication status of assigned partitions as a gate between kafka broker rolling-restarts - https://phabricator.wikimedia.org/T346741 (10brouberol) [13:05:43] brouberol: The cookbook has finished. kafka-jumbo restarted. [13:06:25] :+1 thanks! if that's alright with you, I'll proceed with the provisioning of kafka-jumbo1011->1015 after a rebase [13:07:26] !log redeploying refinery from deployment.eqiad.wmnet using scap [13:07:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:11:01] in that spirit, can I ask for an approval of https://gerrit.wikimedia.org/r/c/operations/puppet/+/957919/10 and all subsequent PRs? Thanks! [13:21:55] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform: Increase Max Message Size in Kafka Jumbo - https://phabricator.wikimedia.org/T344688 (10BTullis) This is deployed and all of the brokers in the kafka-jumbo cluster have been restarted. I'll lea... [13:23:07] 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Use the replication status of assigned partitions as a gate between kafka broker rolling-restarts - https://phabricator.wikimedia.org/T346741 (10brouberol) [13:32:46] !log deployment successful [13:32:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:42:50] 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Stevemunene) [13:43:37] !log deploying airflow analytics dag [13:43:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:46:10] Hi mforns - Would you mind checking https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/488 when you have a minute? I have changed the second job as well :) [13:48:11] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:42] (SystemdUnitFailed) firing: (4) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:50] (SystemdUnitFailed) firing: (5) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:53:38] 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: `an-test-client1001.eqiad.wmnet` - an-test-client1001.... [13:53:46] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [13:54:31] 10Data-Platform-SRE, 10Patch-For-Review: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 (10brouberol) [13:57:44] !log pushing out https://gerrit.wikimedia.org/r/c/operations/puppet/+/955893 for new refinery job jar files [13:57:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:57:45] (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:44] PROBLEM - Check systemd state on kafka-jumbo1015 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:19] 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) [14:00:47] (03PS1) 10Brouberol: Add kafka-jumbo10[11-15].eqiad.wmnet to the gobblin broker list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958939 (https://phabricator.wikimedia.org/T336041) [14:02:51] 10Data-Platform-SRE, 10decommission-hardware, 10Patch-For-Review: decommission an-test-client1001.eqiad.wmnet - https://phabricator.wikimedia.org/T343520 (10Stevemunene) [14:03:36] joal will look! [14:07:24] PROBLEM - Kafka Broker Server #page on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:08:01] 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10mfossati) CC @xcollazo @JAllemandou @BTullis for your consideration. [14:10:34] 10Data-Platform-SRE: Package kafka-kit binaries (topicmappr, metricsfetcher, ...) as a debian-package - https://phabricator.wikimedia.org/T346763 (10brouberol) [14:11:20] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:12:37] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) [14:12:48] RECOVERY - Check systemd state on kafka-jumbo1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:02] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[09-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [14:13:04] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10brouberol) [14:13:07] 10Data-Platform-SRE: Package kafka-kit binaries (topicmappr, metricsfetcher, ...) as a debian-package - https://phabricator.wikimedia.org/T346763 (10brouberol) [14:13:14] RECOVERY - Kafka Broker Server #page on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:13:46] RECOVERY - Kafka broker TLS certificate validity on kafka-jumbo1015 is OK: SSL OK - Certificate kafka-jumbo1015.eqiad.wmnet valid until 2024-09-18 13:48:00 +0000 (expires in 364 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:14:51] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) [14:14:55] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[09-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [14:15:03] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) [14:15:07] 10Data-Platform-SRE, 10Patch-For-Review: Reassign partitions away from kafka-jumbo100[1-6] to kafka-jumbo10[09-15] brokers - https://phabricator.wikimedia.org/T346425 (10brouberol) [14:16:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on alert1001 is CRITICAL: 4.442e+04 gt 1000 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [14:17:36] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:25] !log airflow analytics deployment with scap successful [14:19:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:21:08] (03PS9) 10Btullis: Update to Superset version 2.1.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [14:29:52] (03PS1) 10Milimetric: Send etag header on all AQS responses [analytics/aqs] - 10https://gerrit.wikimedia.org/r/958945 (https://phabricator.wikimedia.org/T342213) [14:30:47] 10Data-Platform-SRE, 10CAS-SSO, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) We successfully implemented OIDC on production datahub and auth/login seems to be working great. However there are some challenges with the user jour... [14:32:54] (03CR) 10CI reject: [V: 04-1] Send etag header on all AQS responses [analytics/aqs] - 10https://gerrit.wikimedia.org/r/958945 (https://phabricator.wikimedia.org/T342213) (owner: 10Milimetric) [14:33:04] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on alert1001 is OK: (C)1000 gt (W)100 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [14:35:43] 10Data-Engineering, 10Dumps 2.0, 10Data Products (Sprint 00): Develop Dumps Triage Runbook - https://phabricator.wikimedia.org/T343325 (10VirginiaPoundstone) 05Open→03Resolved [14:37:56] 10Data-Platform-SRE: analytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10BTullis) 05Open→03Resolved [14:38:47] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, 10Patch-For-Review: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10phuedx) >>! In T326002#9176238, @gmodena wrote: > == Next steps > - []... [14:50:20] 10Data-Platform-SRE, 10Data-Catalog: DataHub rights assignment is case-sensitive - https://phabricator.wikimedia.org/T309382 (10Stevemunene) 05Open→03Resolved This was resolved by the switch to OIDC, marking it as resolved. [15:00:27] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [15:00:27] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [15:07:36] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:16] (03CR) 10TChin: [C: 03+2] Skip schema-deterministic-types for metrics_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/954097 (https://phabricator.wikimedia.org/T344511) (owner: 10TChin) [15:16:50] (03Merged) 10jenkins-bot: Skip schema-deterministic-types for metrics_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/954097 (https://phabricator.wikimedia.org/T344511) (owner: 10TChin) [15:17:45] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:17] 10Data-Platform-SRE: Install kafka-kit binaries on kafka brokers - https://phabricator.wikimedia.org/T346764 (10BTullis) This might be interesting in terms of the Debian packaging: https://gitlab.wikimedia.org/repos/sre/wmf-debci It is the state of the art in terms of our CI based packaging, but I haven't tried... [15:48:43] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10Gehel) [15:49:07] 10Data-Platform-SRE: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 (10Gehel) 05Open→03Resolved [15:49:20] 10Data-Platform-SRE, 10Discovery-Search (Current work): Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster - https://phabricator.wikimedia.org/T344614 (10bking) [15:49:25] 10Data-Platform-SRE, 10Discovery-Search (Current work): Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process - https://phabricator.wikimedia.org/T345957 (10bking) 05Open→03Resolved Thanks for all your help as well. I believe this is done, but please let us know if we need to... [16:14:42] (03Abandoned) 10Mforns: Remove null entry from custom_data.[].value enum in monoschema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/958528 (https://phabricator.wikimedia.org/T344235) (owner: 10Mforns) [16:17:20] (03PS1) 10Mforns: Remove null entry from custom_data.[].value enum in monoschema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/958992 (https://phabricator.wikimedia.org/T344235) [16:23:18] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10BTullis) Disk usage hit 100% and I did this again: ` btullis@krb1001:~$ sudo truncate -s 10000 /var/log/kerberos/krb5kdc.log ` This was the size beforehand. ` btullis@... [16:41:01] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10serviceops-radar, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [16:41:18] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10bking) 05Open→03Resolved The patch above ensures that Data Platform SREs will be alerted if there's a problem with the flink-zk... [16:43:04] 10Data-Engineering, 10Structured-Data-Backlog: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production - https://phabricator.wikimedia.org/T343844 (10xcollazo) confirmed that the artifacts in question are the same: The one deployed on Airflow: ` xcollazo@stat1007:/mnt/hdfs/wmf/c... [17:10:51] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10bking) Per today's Data Platform SRE meeting, I've committed to lead this effort. To that end, I've [[ https://docs.google.com/spreadsheets/d/1hwVX_2va8pHgJJca4r8... [17:12:36] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:58] PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:24] RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:30] 10Data-Platform-SRE, 10observability, 10Epic: Review alerting around Search update pipeline - https://phabricator.wikimedia.org/T346807 (10bking) [17:17:36] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:23:43] (03PS10) 10Btullis: Update to Superset version 2.0.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [17:55:24] (03CR) 10Btullis: [C: 03+1] Add kafka-jumbo10[11-15].eqiad.wmnet to the gobblin broker list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958939 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [18:42:56] (03CR) 10Bearloga: [C: 03+2] Minor change to stream name [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 (owner: 10Sharvaniharan) [18:43:32] (03Merged) 10jenkins-bot: Minor change to stream name [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/956937 (owner: 10Sharvaniharan) [19:00:42] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [19:00:42] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [19:15:20] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 2), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10gmodena) [19:17:22] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10RKemper) [19:33:29] (03CR) 10Brouberol: [C: 03+2] Add kafka-jumbo10[11-15].eqiad.wmnet to the gobblin broker list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958939 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [19:33:31] (03CR) 10Brouberol: [V: 03+2 C: 03+2] Add kafka-jumbo10[11-15].eqiad.wmnet to the gobblin broker list [analytics/refinery] - 10https://gerrit.wikimedia.org/r/958939 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [20:12:32] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team (Sprint 2), 10Patch-For-Review: Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10JAllemandou) [21:17:36] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:14] 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config - https://phabricator.wikimedia.org/T342463 (10bking) a:03bking This is blocking T342538 , so I'm starting on it now. [21:21:30] 10Data-Platform-SRE: Write new partman recipe for cloudelastic (jbod) and update relevant Elastic config - https://phabricator.wikimedia.org/T342463 (10bking) [21:27:36] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:32:36] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:07] 10Data-Engineering, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Movement-Insights, and 2 others: Unique Devices seasonal trends on small projects - https://phabricator.wikimedia.org/T344381 (10Mayakp.wiki) @CorinnaHillebrand_WMDE , can you please confirm if this is completed? and when wa... [22:10:03] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) Once the patch above is merged, I think we'll need to do a little o... [23:00:42] (MediawikiPageContentChangeEnrichJobManagerNotRunning) firing: ... [23:00:42] mw_page_content_change_enrich in eqiad is not running - TODO - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s&var-namespace=mw-page-content-change-enrich&var-helm_release=main&var-operator_name=All&var-flink_job_name=mw_page_content_change_enrich - https://alerts.wikimedia.org/?q=alertname%3DMediawikiPageContentChangeEnrichJobManagerNotRunning [23:25:33] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Provision Zookeeper Cluster for storing Flink HA data - https://phabricator.wikimedia.org/T341792 (10colewhite) >>! In T341792#9179651, @gerritbot wrote: > Change 958991 **merged** by Btullis: > %%%[operations/puppet@production] Add...