[01:56:43] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:34:53] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:36:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:45:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[04:45:47] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:10:32] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) >>! In T337446#8885229, @Ladsgroup wrote: > I think something is replaying transactions twice sometimes (and probably in 10.4.29). e.g. for the s1 broken re...
[06:19:28] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) For what is worth, these hosts were running previously: 10.4.26 ` dpkg.log:2023-05-24 08:49:57 upgrade wmf-mariadb104:amd64 10.4.26+deb11u1 10.4.29+deb11u1...
[06:20:27] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui)
[07:10:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[08:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:58:59] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10MusikAnimal)
[09:51:20] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) For what is worth, s3 and s5 haven't stopped yet (it's been more than 3h since I started the shutdown process with innodb_fast_shutdown=0 for the downgrade....
[11:52:47] <wikibugs>	 10Quarry, 10Patch-For-Review: Show replication lag - https://phabricator.wikimedia.org/T60841 (10Framawiki) https://github.com/toolforge/quarry/pull/22/
[12:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:14:43] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services, 10cloud-services-team: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 (10Marostegui) Still waiting for s3 and s5 to be stopped.
[19:15:40] <wikibugs>	 10Data-Engineering, 10API Platform (AQS 2.0 Roadmap), 10Epic, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Media Analytics service - Unit testing - https://phabricator.wikimedia.org/T336383 (10BPirkle) p:05Triage→03Medium a:05BPirkle→03codebug
[20:51:43] <jinxer-wm>	 (SystemdUnitFailed) firing: rsync-published.service Failed on stat1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:37:06] <wikibugs>	 10Data-Engineering-Planning: Bug/Incident Report [TEMPLATE] - https://phabricator.wikimedia.org/T320633 (10Aklapper) 05Open→03Invalid