[00:04:53] PROBLEM - snapshot of s2 in eqiad on alert1001 is CRITICAL: snapshot for s2 at eqiad taken more than 3 days ago: Most recent backup 2021-12-01 23:47:50 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:50:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:44:17] PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 7.114e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:41:41] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.119e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:32:06] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743533 (owner: 10Juan90264) [03:35:41] (03PS6) 10Juan90264: Enable Autopatroller level page protection for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743533 (https://phabricator.wikimedia.org/T296580) [03:44:24] (03PS1) 10Wugapodes: Enwiki config: remove autopatrol from sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743646 (https://phabricator.wikimedia.org/T297058) [03:53:39] PROBLEM - snapshot of s3 in eqiad on alert1001 is CRITICAL: snapshot for s3 at eqiad taken more than 3 days ago: Most recent backup 2021-12-02 03:38:46 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:00:51] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:14:13] (03CR) 104nn1l2: [C: 03+1] Enwiki config: remove autopatrol from sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743646 (https://phabricator.wikimedia.org/T297058) (owner: 10Wugapodes) [05:44:55] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:21:33] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211205T0800) [08:48:33] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:23:53] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:27:01] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [09:31:23] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [10:15:31] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [10:17:43] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [10:35:01] PROBLEM - snapshot of x1 in eqiad on alert1001 is CRITICAL: snapshot for x1 at eqiad taken more than 3 days ago: Most recent backup 2021-12-02 10:10:29 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [11:01:41] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:02:30] (03PS1) 10Majavah: Disable UserMerge on labswiki (wikitech) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743659 [11:10:15] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 23692 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:59:31] (03CR) 10Urbanecm: [C: 03+1] "code OK, matches what agreed on in the RfC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743646 (https://phabricator.wikimedia.org/T297058) (owner: 10Wugapodes) [12:00:40] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743659 (owner: 10Majavah) [13:22:12] (03CR) 10Majavah: [C: 03+1] noc: Make colors consistent with WikimediaUI style guide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742443 (owner: 10Ladsgroup) [13:29:37] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:15] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:33] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:09] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:47:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:12:25] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [16:14:35] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [16:31:53] (03CR) 10Yahya: [C: 03+1] Enable SandboxLink extension for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743529 (https://phabricator.wikimedia.org/T296637) (owner: 10Juan90264) [16:43:38] (03CR) 10Yahya: [C: 03+1] Enable groups autopatrolled and patroller for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743528 (https://phabricator.wikimedia.org/T296637) (owner: 10Juan90264) [17:34:42] (03PS1) 10Majavah: Use new class names for CentralAuth RC feed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743683 [17:35:12] (03CR) 10Majavah: [C: 04-2] "do not merge until I735dfda80f98274aad42a8164ba6818cdc074cc5 has safely been deployed to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743683 (owner: 10Majavah) [17:41:41] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:19] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:11:17] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [18:13:29] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [19:00:19] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:11] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [19:38:51] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-02-06 03:00:07 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [20:52:59] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:37] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:28] (03PS1) 10Urbanecm: [labs] Set GlobalBlockRemoteReasonUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743695 (https://phabricator.wikimedia.org/T243863) [22:39:17] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook