[01:56:43] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:45:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:45:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:43] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:32] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) >>! In T337446#8885229, @Ladsgroup wrote: > I think something is replaying transactions twice sometimes (and probably in 10.4.29). e.g. for the s1 broken re... [06:19:28] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) For what is worth, these hosts were running previously: 10.4.26 ` dpkg.log:2023-05-24 08:49:57 upgrade wmf-mariadb104:amd64 10.4.26+deb11u1 10.4.29+deb11u1... [06:20:27] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) [07:10:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [08:51:43] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:59] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10MusikAnimal) [09:51:20] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) For what is worth, s3 and s5 haven't stopped yet (it's been more than 3h since I started the shutdown process with innodb_fast_shutdown=0 for the downgrade.... [11:52:47] 10Quarry, 10Patch-For-Review: Show replication lag - https://phabricator.wikimedia.org/T60841 (10Framawiki) https://github.com/toolforge/quarry/pull/22/ [12:51:43] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:43] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:43] 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) Still waiting for s3 and s5 to be stopped. [19:15:40] 10Data-Engineering, 10API Platform (AQS 2.0 Roadmap), 10Epic, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Media Analytics service - Unit testing - https://phabricator.wikimedia.org/T336383 (10BPirkle) p:05Triage→03Medium a:05BPirkle→03codebug [20:51:43] (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:37:06] 10Data-Engineering-Planning: Bug/Incident Report [TEMPLATE] - https://phabricator.wikimedia.org/T320633 (10Aklapper) 05Open→03Invalid