[01:02:47] PROBLEM - snapshot of s5 in codfw on alert1001 is CRITICAL: snapshot for s5 at codfw taken more than 3 days ago: Most recent backup 2022-03-03 00:25:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:12:43] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:40:30] (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:33:19] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [04:02:31] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:41] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:53] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [06:21:11] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220306T0800) [08:34:45] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:46:59] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:16:42] (03PS1) 10Majavah: openstack: fix remaining http keystone urls [puppet] - 10https://gerrit.wikimedia.org/r/768293 (https://phabricator.wikimedia.org/T267194) [11:28:41] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:28:57] (03PS1) 10Majavah: P:wmcs::prometheus: update blackbox urls [puppet] - 10https://gerrit.wikimedia.org/r/768294 [11:30:07] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34090/console" [puppet] - 10https://gerrit.wikimedia.org/r/768294 (owner: 10Majavah) [12:09:21] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:11:23] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:50:19] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:50:37] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:57] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:53:07] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:53:23] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:45] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:33:59] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:05:11] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:38:11] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10AlexisJazz) Some observations: * [[https://en.wikipedia.org/wiki/... [16:47:43] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled ht [16:47:43] kitech.wikimedia.org/wiki/PyBal [16:48:23] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled ht [16:48:23] kitech.wikimedia.org/wiki/PyBal [16:50:43] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:18] PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:52:41] FYI, that went off earlier in the day but didn't hash page [16:56:03] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:07] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:11] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:17] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:01:35] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6011 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 86304 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [17:01:51] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp6011 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 86288 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [17:04:01] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:41] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:21] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 410 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:14:41] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:09] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 410 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:17:23] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 410 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:22:21] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:22:31] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:22:43] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:49] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.730 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:24:14] RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:25:11] RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:25:15] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:25:43] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [17:25:55] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:17:30] ? [18:17:31] ? [18:17:32] ? [18:17:34] ~ [18:40:07] 10SRE, 10Release-Engineering-Team, 10Scap, 10Python3-Porting: git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10thcipriani) [19:10:03] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:06:07] RECOVERY - snapshot of s5 in codfw on alert1001 is OK: Last snapshot for s5 at codfw (db2101.codfw.wmnet:3315) taken on 2022-03-06 18:08:53 (807 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:04:28] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:08:14] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:08] RECOVERY - snapshot of s4 in eqiad on alert1001 is OK: Last snapshot for s4 at eqiad (db1150.eqiad.wmnet:3314) taken on 2022-03-06 19:15:56 (1606 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting