[00:03:17] PROBLEM - Host db1131 is DOWN: PING CRITICAL - Packet loss = 100% [00:04:39] RECOVERY - Host db1131 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [00:14:40] 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10Performance-Team (Radar): Add cache key information to metadata json - https://phabricator.wikimedia.org/T257093 (10Reedy) Ok, I have now deleted everything before `20210101000000` ` reedy@mwmaint1002:~$ mwscript extensions/Score/maintenance/GetL... [00:22:41] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:25:29] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:25] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:01] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:42:57] (03PS12) 10Andrew Bogott: OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) [00:46:21] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: Always use the tls ports (25000, 25357) for openstack services [puppet] - 10https://gerrit.wikimedia.org/r/732738 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [00:55:13] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:13] (03PS1) 10Andrew Bogott: Designate: specify https for keystone authtoken url [puppet] - 10https://gerrit.wikimedia.org/r/733353 (https://phabricator.wikimedia.org/T267194) [00:58:29] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:31] (03CR) 10Andrew Bogott: [C: 03+2] Designate: specify https for keystone authtoken url [puppet] - 10https://gerrit.wikimedia.org/r/733353 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [01:14:12] (03PS1) 10Andrew Bogott: Striker: use https endpoint for OpenStack [puppet] - 10https://gerrit.wikimedia.org/r/733362 (https://phabricator.wikimedia.org/T267194) [01:15:01] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:15:31] (03CR) 10Andrew Bogott: [C: 03+2] Striker: use https endpoint for OpenStack [puppet] - 10https://gerrit.wikimedia.org/r/733362 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [01:16:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:18:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:21:00] (03PS1) 10Andrew Bogott: keystone: use https check rather than http check for endpoint checks [puppet] - 10https://gerrit.wikimedia.org/r/733363 (https://phabricator.wikimedia.org/T267194) [01:21:57] (03CR) 10Andrew Bogott: [C: 03+2] keystone: use https check rather than http check for endpoint checks [puppet] - 10https://gerrit.wikimedia.org/r/733363 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [01:22:41] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:25:03] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:36:30] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:55:20] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:00] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:32] (03PS1) 10A2093064: Add User and User talk to $wgExemptFromUserRobotsControl on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733403 (https://phabricator.wikimedia.org/T288947) [02:40:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:04] (03PS1) 10Andrew Bogott: nova: increase timeout for flavor monitoring [puppet] - 10https://gerrit.wikimedia.org/r/733442 [03:17:39] (03CR) 10Andrew Bogott: [C: 03+2] nova: increase timeout for flavor monitoring [puppet] - 10https://gerrit.wikimedia.org/r/733442 (owner: 10Andrew Bogott) [03:25:22] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:37] (03PS2) 10Andrew Bogott: Openstack haproxy: Fix keystone internal port [puppet] - 10https://gerrit.wikimedia.org/r/732087 [03:31:20] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:09] (03PS1) 10Andrew Bogott: Revert "nova: increase timeout for flavor monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/733119 [03:34:19] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova: increase timeout for flavor monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/733119 (owner: 10Andrew Bogott) [04:40:30] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:38] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:26] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:22] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:55:14] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:53] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/733120 (https://phabricator.wikimedia.org/T291146) (owner: 10Juan90264) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211024T0700) [07:25:20] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:02] PROBLEM - graphite.wikimedia.org api on graphite2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.110 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [07:50:00] RECOVERY - graphite.wikimedia.org api on graphite2003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [08:01:44] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:10:24] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:38] PROBLEM - graphite.wikimedia.org api on graphite2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.114 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [08:12:40] RECOVERY - graphite.wikimedia.org api on graphite2003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [08:12:46] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:16:32] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:14] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:06] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:53:04] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:55:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:14] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:10] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:23:18] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:39:53] 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n, 10MW-1.38-notes (1.38.0-wmf.5; 2021-10-19): Fix mime type and text encoding in Content-Type HTTP header of LilyPond .ly file output - https://phabricator.wikimedia.org/T184871 (10Reedy) [11:40:18] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:30] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:38] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:40] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:46:38] 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n, 10MW-1.38-notes (1.38.0-wmf.5; 2021-10-19): Fix mime type and text encoding in Content-Type HTTP header of LilyPond .ly file output - https://phabricator.wikimedia.org/T184871 (10TheDJ) 05Open→03Resolved a:03TheDJ [12:53:46] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:58:51] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Traffic-Icebox, 10Maps (Kartotherian): Geoshapes service is not sending 'access-control-allow-origin' header to some requests - https://phabricator.wikimedia.org/T241644 (10TheDJ) [12:59:14] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Traffic-Icebox, 10Maps (Kartotherian): Geoshapes service is not sending 'access-control-allow-origin' header to some requests - https://phabricator.wikimedia.org/T241644 (10TheDJ) We are still seeing this problem. [13:10:38] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:31] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Traffic-Icebox, 10Maps (Kartotherian): Geoshapes service is not sending 'access-control-allow-origin' header to some requests - https://phabricator.wikimedia.org/T241644 (10TheDJ) OK, just by refreshing I can just 'fill the cache' and every refresh i have mor... [13:16:48] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:21] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Traffic-Icebox, 10Maps (Kartotherian): Geoshapes service is not sending 'access-control-allow-origin' header to some requests - https://phabricator.wikimedia.org/T241644 (10TheDJ) Also of note, there is no Cache-Control response header, only Age... There is a... [13:25:02] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:46] (03CR) 10Urbanecm: [C: 03+1] "okay to merge now, we're ready (scheduled for Thursday)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/717506 (https://phabricator.wikimedia.org/T205347) (owner: 10Urbanecm) [14:10:12] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:22] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:18] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:55:32] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:44] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:22] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:51:48] (03PS3) 10Andrew Bogott: Openstack haproxy: Fix keystone internal port [puppet] - 10https://gerrit.wikimedia.org/r/732087 [15:51:50] (03PS1) 10Andrew Bogott: codfw1dev openstack: fix keystone http port number [puppet] - 10https://gerrit.wikimedia.org/r/733893 [15:55:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:33] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev openstack: fix keystone http port number [puppet] - 10https://gerrit.wikimedia.org/r/733893 (owner: 10Andrew Bogott) [16:01:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:17] (03PS4) 10Andrew Bogott: Openstack haproxy: Revise keystone internal port [puppet] - 10https://gerrit.wikimedia.org/r/732087 [16:40:30] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:40] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:04] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:20] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:30] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:36] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:44] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:55:30] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:08:36] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:10:02] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:28] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:16] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:06] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:23:12] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:25:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:38] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:36] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:50] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:02] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:46] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1178.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica