[00:16:42] <jinxer-wm>	 (SystemdUnitFailed) firing: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:18:50] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:02] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:36:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: hardsync-published.service Failed on an-web1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:12:11] <elukey>	 !log wipe kafka-test cluster (data + zookeper config) to start clean after the issue happened yesterday
[08:12:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:19:18] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search, 10serviceops-radar: Test version compatibility between production Kafka and newer ZooKeeper - https://phabricator.wikimedia.org/T341137 (10elukey) I am testing kafka 1.1.0 with a Zookeeper on Debian Bookworm, I'll report results in a few days :) The Kafka cluster is t...
[08:32:08] <wikibugs>	 (03Abandoned) 10Urbanecm: Add Czech Wikipedia to clickstream dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/930895 (https://phabricator.wikimedia.org/T339805) (owner: 10Urbanecm)
[09:15:03] <stevemunene>	 !log run puppet on hadoop masters to pick up changes from recently decommissioned hosts
[09:15:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:28:55] <stevemunene>	 !log running sre.hadoop.roll-restart-masters restart the maters to completely remove any reference of analytics[1058-1069] T317861
[09:28:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:28:59] <stashbot>	 T317861: Decommission analytics10[58-69] - https://phabricator.wikimedia.org/T317861
[09:52:16] <stevemunene>	 o/ btullis Getting this from the sre.hadoop.roll-restart-masters. an-master1002 is the current active node  https://www.irccloud.com/pastebin/nRPvuAqY/
[09:53:44] <btullis>	 stevemunene: I would tend to shy away from this kind of work on a Friday, unless it's urgent. Looking now.
[09:55:09] <btullis>	 So the namenode service on an-master1001 has failed and an-master1002 is the currently active namenode.
[09:55:14] <btullis>	 https://www.irccloud.com/pastebin/6ljl2I6a/
[09:55:45] <btullis>	 https://www.irccloud.com/pastebin/YnFAxXe9/
[09:55:55] <btullis>	 I'll try starting it.
[09:56:21] <btullis>	 !log `sudo systemctl start hadoop-hdfs-namenode.service ` on an-master1001
[09:56:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:56:37] <btullis>	 https://www.irccloud.com/pastebin/5l4nmeVG/
[09:56:54] <btullis>	 Appears to have started successfully 👍
[09:58:55] <btullis>	 I thought we found out that we didn't have to restart the namenode processes any more, after the nodes were excluded.
[09:59:05] <stevemunene>	 Sorry about that, I'll avoid any restart jobs on Fridays
[09:59:54] <btullis>	 It's ok. I've done it too, it's just a potentially disruptive operation and if we don't need to do it, I'd avoid it.
[10:01:39] <btullis>	 Namenode on an-master1001 is still not healthy.
[10:01:43] <btullis>	 https://www.irccloud.com/pastebin/1cEsmMWY/
[10:02:04] <btullis>	 Maybe it'll be ok in a few minutes
[10:02:39] <btullis>	 Yeah, looks better now.
[10:02:43] <btullis>	 https://www.irccloud.com/pastebin/78J9X4KA/
[10:03:56] <btullis>	 We should probably fail it back to the master at some point today, but I'd certainly give it a bit of time to settle first. What state is your cookbook in?
[10:04:01] <icinga-wm>	 RECOVERY - HDFS topology check on an-master1001 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_topology_check
[10:08:36] <stevemunene>	 The cookbook was in a failed state END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons.
[10:10:58] <btullis>	 OK, that's fine. So it ran to completion, but exited with a fail. We don't yet know why the namenode on an-master1001 failed, but it's running again now.
[10:13:10] <btullis>	 So here's where it received the command to shut down. `2023-07-07 09:33:18,922 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM`
[10:13:25] <btullis>	 That's in `/var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log`
[10:15:12] <stevemunene>	 Seems to be around the time the cookbook was running
[10:16:39] <btullis>	 Yes that would make sense. Scrolling through the logs from 09:38 it looks like a successful startup so far. But the service had failed by 09:56 when I ran the command to start it again. Still reading...
[10:21:38] <btullis>	 Here we are. Failure to write to enough of the journal nodes.
[10:21:44] <btullis>	 https://www.irccloud.com/pastebin/9xqL3XG3/
[10:22:10] <elukey>	 btullis: I found also this in the syslog
[10:22:11] <elukey>	 Jul  7 09:33:23 an-master1001 hadoop-hdfs-namenode[20216]: namenode did not stop gracefully after 5 seconds: killing with kill -9
[10:22:14] <elukey>	 Jul  7 09:33:23 an-master1001 hadoop-hdfs-namenode[20216]: Stopped Hadoop namenode:.
[10:22:17] <elukey>	 that explains the sigterm
[10:22:18] <elukey>	 :(
[10:27:01] <btullis>	 Maybe we should let is have a bit more than 5 seconds to shut down gracefully. It definitely recorded the signal 15 in its log and was shutting down.
[10:28:50] <btullis>	 I'm a bit worried about the failure to write to a quorum of the journal nodes though, 12 minutes after starting up. That was the fatal error that caused it to shut down again.
[10:31:50] <elukey>	 ah I didn't read correctly, the log mentioned a 15
[10:32:06] <elukey>	 sorry not a sigkill, my bad
[10:32:07] <elukey>	 mmmm
[10:32:16] <elukey>	 weird
[10:33:42] <elukey>	 but it was standby at that point right? 
[10:33:45] <btullis>	 Nono, I think you were right. The cookbook tried to restart it and sent a 15 at 09:33:18 - then 5 seconds later systemd got brutal and sent it a 9 at 09:33:23
[10:34:27] <elukey>	 ack ack
[10:34:29] <btullis>	 Yes I think it was probably standby at that point, but it depends how far through Steve had got with the cookbook.
[10:34:50] <elukey>	 from the logs I see that it was standby, and the cookbook then tried to restart it
[10:35:10] <elukey>	 from the logs --> before the kill I mean
[10:35:25] <stevemunene>	 This was at the 'manual switch over back to an-master1001' step
[10:35:34] <btullis>	 Good 👍
[10:37:02] <elukey>	 here I see
[10:37:03] <elukey>	 2023-07-07 09:44:23,103 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode
[10:37:17] <elukey>	 so the standby is trying to sync with the active, in theory
[10:40:19] <elukey>	 this is very weird, but I noticed serveral of these:
[10:40:19] <elukey>	 2023-07-07 09:45:10,332 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approxi
[10:40:22] <elukey>	 mately 9323ms
[10:40:49] <elukey>	 those logs IIRC indicated a complete stall of running code to allow the GC to run
[10:40:54] <elukey>	 almost 10 seconds
[10:41:12] <elukey>	 and after a few:
[10:41:12] <elukey>	 2023-07-07 09:45:19,894 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approxi
[10:41:15] <elukey>	 mately 9061ms
[10:41:22] <elukey>	 another 9 seconds stall
[10:41:38] <elukey>	 again and again until it failed
[10:42:16] <elukey>	 the only suggestion that I can think of is that the GC played a role
[10:42:20] <btullis>	 The shortly after this was when it started having trouble flushing writes to the journal nodes as well.
[10:42:35] <elukey>	 exactly yes
[10:43:13] <btullis>	 Yes, I agree. I can't see any problem on the journal nodes themselves. Nothing particularly busy about them https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fanalytics&var-hadoop_cluster=analytics-hadoop&var-worker=All
[11:30:32] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis) The hadoop-test workers are all upgraded to bullseye. ` btullis@cumin1001:~$ sudo cumin 'P{F:lsbdistcodename = buster} and A:hadoop-worker-test' No hosts found that matches t...
[11:30:49] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Hadoop test cluster to Bullseye - https://phabricator.wikimedia.org/T329363 (10BTullis)
[11:47:38] <wikibugs>	 (03PS1) 10Btullis: Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514)
[12:00:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[12:26:06] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10Ladsgroup) It's done from our side, handing over to @BTullis
[12:31:49] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.0 - https://phabricator.wikimedia.org/T329514 (10BTullis) At long last I have got into the datahub frontend on staging, running version 0.10.4. I had to use the default credentials of `datahub:datahub` so it looks like the JAAS configuration for...
[12:31:58] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis)
[12:37:51] <wikibugs>	 (03PS2) 10Btullis: Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514)
[12:59:38] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[13:12:05] <wikibugs>	 (03Merged) 10jenkins-bot: Update the MAE and MCE entrypoints [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/936260 (https://phabricator.wikimedia.org/T329514) (owner: 10Btullis)
[14:03:26] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10bking) Current state: 2019 and 2020 are production-ready. The others need a data transfer and/or scap deploy to be...
[14:05:37] <wmf-insecte>	 Starting build #25 for job wikimedia-event-utilities-maven-release-docker
[14:10:06] <wmf-insecte>	 Project wikimedia-event-utilities-maven-release-docker build #25: 09SUCCESS in 4 min 30 sec: https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/25/
[14:23:02] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 0): ProduceCanaryEvents job should be scheduled by Airflow - https://phabricator.wikimedia.org/T341229 (10Ottomata) Oo, it would be really nice if we could modify the job logic a little bit, to be able to produce events with the time approp...
[14:36:06] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Epic, 10Event-Platform: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 (10Ottomata)
[15:06:37] <wikibugs>	 (03PS1) 10Ottomata: Use eventutilities-spark JsonSchemaSparkConverter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854)
[15:09:47] <wikibugs>	 (03CR) 10Ottomata: [C: 04-1] "Not yet tested with Refine, just code!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata)
[15:12:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use eventutilities-spark JsonSchemaSparkConverter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata)
[15:16:22] <wikibugs>	 (03PS2) 10Ottomata: Use eventutilities-spark JsonSchemaSparkConverter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/936293 (https://phabricator.wikimedia.org/T321854)
[17:58:07] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Make meta.dt required on all schemas that declare it - https://phabricator.wikimedia.org/T340044 (10xcollazo) After discussions with @Ottomata, we speculated that the fact that `meta.dt` is not marked as `required` [[ https://g...
[17:59:10] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: Make meta.dt required on all schemas that declare it - https://phabricator.wikimedia.org/T340044 (10xcollazo)
[18:15:07] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Datahub to v0.10.4 - https://phabricator.wikimedia.org/T329514 (10BTullis) I've fixed the LDAP issue now, so I can log into the staging version of datahub again. Next I have to check that the MAE and MCE consumers are operating correctly. {F37132406,width=60%}
[19:14:41] <wikibugs>	 10Analytics, 10Data-Engineering-Icebox: Create a tool checking for data presence based on file-size - https://phabricator.wikimedia.org/T256644 (10Gopavasanth) Hi @JAllemandou could you also share why this tool is needed, any use case might help understand better? Also "Create a tool checking for data presence...
[19:48:28] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Epic: [Airflow] User manual and documentation - https://phabricator.wikimedia.org/T295199 (10odimitrijevic)
[20:43:37] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA, 10Data-Services: Prepare and check storage layer for gpewiki - https://phabricator.wikimedia.org/T338678 (10BTullis)
[20:44:38] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye
[20:47:14] <wikibugs>	 (03PS1) 10Mazevedo: Add missing source property to ios search schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936337
[20:48:09] <wikibugs>	 (03PS2) 10Mazevedo: Add missing source property to ios search schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936337 (https://phabricator.wikimedia.org/T335544)
[20:49:18] <wikibugs>	 (03CR) 10Kimberly Sarabia: "This change is ready for review." (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/934032 (https://phabricator.wikimedia.org/T337270) (owner: 10Kimberly Sarabia)
[21:08:42] <wikibugs>	 (03CR) 10Tsevener: [C: 03+2] Add missing source property to ios search schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936337 (https://phabricator.wikimedia.org/T335544) (owner: 10Mazevedo)
[21:09:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add missing source property to ios search schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/936337 (https://phabricator.wikimedia.org/T335544) (owner: 10Mazevedo)
[21:17:42] <jinxer-wm>	 (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:19:04] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:06] <wikibugs>	 (03CR) 10Bearloga: [V: 03+2 C: 03+2] "End of an era. Thank you" [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/932458 (https://phabricator.wikimedia.org/T333218) (owner: 10Neil Shah-Quinn (WMF))
[21:30:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:36:33] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Configure new WDQS servers in codfw (wdqs20[13-22]) - https://phabricator.wikimedia.org/T332314 (10bking) Update: wdqs[2017-2021].codfw.wmnet are now production ready:  ` ===== NODE GROUP ===== (4) wdqs[2014-2016,...
[22:04:58] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye executed with errors: - an-wo...