[00:16:42] (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:50] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:02] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:42] (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:11] !log wipe kafka-test cluster (data + zookeper config) to start clean after the issue happened yesterday [08:12:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:19:18] 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10elukey) I am testing kafka 1.1.0 with a Zookeeper on Debian Bookworm, I'll report results in a few days :) The Kafka cluster is t... [08:32:08] (03Abandoned) 10Urbanecm: Add Czech Wikipedia to clickstream dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/930895 (https://phabricator.wikimedia.org/T339805) (owner: 10Urbanecm) [09:15:03] !log run puppet on hadoop masters to pick up changes from recently decommissioned hosts [09:15:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:28:55] !log running sre.hadoop.roll-restart-masters restart the maters to completely remove any reference of analytics[1058-1069] T317861 [09:28:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:28:59] T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861 [09:52:16] o/ btullis Getting this from the sre.hadoop.roll-restart-masters. an-master1002 is the current active node https://www.irccloud.com/pastebin/nRPvuAqY/ [09:53:44] stevemunene: I would tend to shy away from this kind of work on a Friday, unless it's urgent. Looking now. [09:55:09] So the namenode service on an-master1001 has failed and an-master1002 is the currently active namenode. [09:55:14] https://www.irccloud.com/pastebin/6ljl2I6a/ [09:55:45] https://www.irccloud.com/pastebin/YnFAxXe9/ [09:55:55] I'll try starting it. [09:56:21] !log `sudo systemctl start hadoop-hdfs-namenode.service ` on an-master1001 [09:56:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:56:37] https://www.irccloud.com/pastebin/5l4nmeVG/ [09:56:54] Appears to have started successfully 👍 [09:58:55] I thought we found out that we didn't have to restart the namenode processes any more, after the nodes were excluded. [09:59:05] Sorry about that, I'll avoid any restart jobs on Fridays [09:59:54] It's ok. I've done it too, it's just a potentially disruptive operation and if we don't need to do it, I'd avoid it. [10:01:39] Namenode on an-master1001 is still not healthy. [10:01:43] https://www.irccloud.com/pastebin/1cEsmMWY/ [10:02:04] Maybe it'll be ok in a few minutes [10:02:39] Yeah, looks better now. [10:02:43] https://www.irccloud.com/pastebin/78J9X4KA/ [10:03:56] We should probably fail it back to the master at some point today, but I'd certainly give it a bit of time to settle first. What state is your cookbook in? [10:04:01] RECOVERY - HDFS topology check on an-master1001 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check [10:08:36] The cookbook was in a failed state END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [10:10:58] OK, that's fine. So it ran to completion, but exited with a fail. We don't yet know why the namenode on an-master1001 failed, but it's running again now. [10:13:10] So here's where it received the command to shut down. `2023-07-07 09:33:18,922 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM` [10:13:25] That's in `/var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log` [10:15:12] Seems to be around the time the cookbook was running [10:16:39] Yes that would make sense. Scrolling through the logs from 09:38 it looks like a successful startup so far. But the service had failed by 09:56 when I ran the command to start it again. Still reading... [10:21:38] Here we are. Failure to write to enough of the journal nodes. [10:21:44] https://www.irccloud.com/pastebin/9xqL3XG3/ [10:22:10] btullis: I found also this in the syslog [10:22:11] Jul 7 09:33:23 an-master1001 hadoop-hdfs-namenode[20216]: namenode did not stop gracefully after 5 seconds: killing with kill -9 [10:22:14] Jul 7 09:33:23 an-master1001 hadoop-hdfs-namenode[20216]: Stopped Hadoop namenode:. [10:22:17] that explains the sigterm [10:22:18] :( [10:27:01] Maybe we should let is have a bit more than 5 seconds to shut down gracefully. It definitely recorded the signal 15 in its log and was shutting down. [10:28:50] I'm a bit worried about the failure to write to a quorum of the journal nodes though, 12 minutes after starting up. That was the fatal error that caused it to shut down again. [10:31:50] ah I didn't read correctly, the log mentioned a 15 [10:32:06] sorry not a sigkill, my bad [10:32:07] mmmm [10:32:16] weird [10:33:42] but it was standby at that point right? [10:33:45] Nono, I think you were right. The cookbook tried to restart it and sent a 15 at 09:33:18 - then 5 seconds later systemd got brutal and sent it a 9 at 09:33:23 [10:34:27] ack ack [10:34:29] Yes I think it was probably standby at that point, but it depends how far through Steve had got with the cookbook. [10:34:50] from the logs I see that it was standby, and the cookbook then tried to restart it [10:35:10] from the logs --> before the kill I mean [10:35:25] This was at the 'manual switch over back to an-master1001' step [10:35:34] Good 👍 [10:37:02] here I see [10:37:03] 2023-07-07 09:44:23,103 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode [10:37:17] so the standby is trying to sync with the active, in theory [10:40:19] this is very weird, but I noticed serveral of these: [10:40:19] 2023-07-07 09:45:10,332 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approxi [10:40:22] mately 9323ms [10:40:49] those logs IIRC indicated a complete stall of running code to allow the GC to run [10:40:54] almost 10 seconds [10:41:12] and after a few: [10:41:12] 2023-07-07 09:45:19,894 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approxi [10:41:15] mately 9061ms [10:41:22] another 9 seconds stall [10:41:38] again and again until it failed [10:42:16] the only suggestion that I can think of is that the GC played a role [10:42:20] The shortly after this was when it started having trouble flushing writes to the journal nodes as well. [10:42:35] exactly yes [10:43:13] Yes, I agree. I can't see any problem on the journal nodes themselves. Nothing particularly busy about them https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-hadoop&var-worker=All [11:30:32] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) The hadoop-test workers are all upgraded to bullseye. ` btullis@cumin1001:~$ sudo cumin 'P{F:lsbdistcodename = buster} and A:hadoop-worker-test' No hosts found that matches t... [11:30:49] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) [11:47:38] (03PS1) 10Btullis: Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514) [12:00:59] (03CR) 10CI reject: [V: 04-1] Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [12:26:06] 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) It's done from our side, handing over to @BTullis [12:31:49] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) At long last I have got into the datahub frontend on staging, running version 0.10.4. I had to use the default credentials of `datahub:datahub` so it looks like the JAAS configuration for... [12:31:58] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) [12:37:51] (03PS2) 10Btullis: Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514) [12:59:38] (03CR) 10Btullis: [C: 03+2] Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [13:12:05] (03Merged) 10jenkins-bot: Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis) [14:03:26] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10bking) Current state: 2019 and 2020 are production-ready. The others need a data transfer and/or scap deploy to be... [14:05:37] Starting build #25 for job wikimedia-event-utilities-maven-release-docker [14:10:06] Project wikimedia-event-utilities-maven-release-docker build #25: 09SUCCESS in 4 min 30 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/25/ [14:23:02] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0): ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10Ottomata) Oo, it would be really nice if we could modify the job logic a little bit, to be able to produce events with the time approp... [14:36:06] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata) [15:06:37] (03PS1) 10Ottomata: Use eventutilities-spark JsonSchemaSparkConverter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) [15:09:47] (03CR) 10Ottomata: [C: 04-1] "Not yet tested with Refine, just code!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [15:12:44] (03CR) 10CI reject: [V: 04-1] Use eventutilities-spark JsonSchemaSparkConverter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [15:16:22] (03PS2) 10Ottomata: Use eventutilities-spark JsonSchemaSparkConverter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) [17:58:07] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Make meta.dt required on all schemas that declare it - https://phabricator.wikimedia.org/T340044 (10xcollazo) After discussions with @Ottomata, we speculated that the fact that `meta.dt` is not marked as `required` [[ https://g... [17:59:10] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Make meta.dt required on all schemas that declare it - https://phabricator.wikimedia.org/T340044 (10xcollazo) [18:15:07] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've fixed the LDAP issue now, so I can log into the staging version of datahub again. Next I have to check that the MAE and MCE consumers are operating correctly. {F37132406,width=60%} [19:14:41] 10Analytics, 10Data-Engineering-Icebox: Create a tool checking for data presence based on file-size - https://phabricator.wikimedia.org/T256644 (10Gopavasanth) Hi @JAllemandou could you also share why this tool is needed, any use case might help understand better? Also "Create a tool checking for data presence... [19:48:28] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Epic: [Airflow] User manual and documentation - https://phabricator.wikimedia.org/T295199 (10odimitrijevic) [20:43:37] 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10BTullis) [20:44:38] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye [20:47:14] (03PS1) 10Mazevedo: Add missing source property to ios search schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936337 [20:48:09] (03PS2) 10Mazevedo: Add missing source property to ios search schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936337 (https://phabricator.wikimedia.org/T335544) [20:49:18] (03CR) 10Kimberly Sarabia: "This change is ready for review." (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia) [21:08:42] (03CR) 10Tsevener: [C: 03+2] Add missing source property to ios search schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936337 (https://phabricator.wikimedia.org/T335544) (owner: 10Mazevedo) [21:09:14] (03Merged) 10jenkins-bot: Add missing source property to ios search schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936337 (https://phabricator.wikimedia.org/T335544) (owner: 10Mazevedo) [21:17:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:06] (03CR) 10Bearloga: [V: 03+2 C: 03+2] "End of an era. Thank you" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/932458 (https://phabricator.wikimedia.org/T333218) (owner: 10Neil Shah-Quinn (WMF)) [21:30:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:36:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10bking) Update: wdqs[2017-2021].codfw.wmnet are now production ready: ` ===== NODE GROUP ===== (4) wdqs[2014-2016,... [22:04:58] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye executed with errors: - an-wo...