[01:02:47] <icinga-wm>	 PROBLEM - snapshot of s5 in codfw on alert1001 is CRITICAL: snapshot for s5 at codfw taken more than 3 days ago: Most recent backup 2022-03-03 00:25:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[01:12:43] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:40:30] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[02:33:19] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[04:02:31] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:10:41] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:53] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[06:21:11] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220306T0800)
[08:34:45] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[10:46:59] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:16:42] <wikibugs>	 (03PS1) 10Majavah: openstack: fix remaining http keystone urls [puppet] - 10https://gerrit.wikimedia.org/r/768293 (https://phabricator.wikimedia.org/T267194)
[11:28:41] <icinga-wm>	 PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:28:57] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::prometheus: update blackbox urls [puppet] - 10https://gerrit.wikimedia.org/r/768294
[11:30:07] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34090/console" [puppet] - 10https://gerrit.wikimedia.org/r/768294 (owner: 10Majavah)
[12:09:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:11:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[13:50:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:50:37] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:53:07] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:53:23] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:45] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:33:59] <icinga-wm>	 RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:05:11] <icinga-wm>	 PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:38:11] <wikibugs>	 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10AlexisJazz) Some observations:  * [[https://en.wikipedia.org/wiki/...
[16:47:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled ht
[16:47:43] <icinga-wm>	 kitech.wikimedia.org/wiki/PyBal
[16:48:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled ht
[16:48:23] <icinga-wm>	 kitech.wikimedia.org/wiki/PyBal
[16:50:43] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:52:18] <icinga-wm>	 PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:52:41] <RhinosF1>	 FYI, that went off earlier in the day but didn't hash page
[16:56:03] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:07] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:11] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:17] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:01:35] <icinga-wm>	 PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6011 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 86304 seconds left https://wikitech.wikimedia.org/wiki/HTTPS
[17:01:51] <icinga-wm>	 PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp6011 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 86288 seconds left https://wikitech.wikimedia.org/wiki/HTTPS
[17:04:01] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:06:41] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:21] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 410 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:14:41] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:17:09] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 410 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:17:23] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 410 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:22:21] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:22:31] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:22:43] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:49] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.730 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:24:14] <icinga-wm>	 RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:25:11] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:25:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:25:43] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:25:55] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[18:17:30] <jbond>	 ?
[18:17:31] <jbond>	 ?
[18:17:32] <jbond>	 ?
[18:17:34] <jbond>	 ~
[18:40:07] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10Scap, 10Python3-Porting: git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10thcipriani)
[19:10:03] <icinga-wm>	 RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:06:07] <icinga-wm>	 RECOVERY - snapshot of s5 in codfw on alert1001 is OK: Last snapshot for s5 at codfw (db2101.codfw.wmnet:3315) taken on 2022-03-06 18:08:53 (807 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[21:40:56] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org
[22:04:28] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:08:14] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:08] <icinga-wm>	 RECOVERY - snapshot of s4 in eqiad on alert1001 is OK: Last snapshot for s4 at eqiad (db1150.eqiad.wmnet:3314) taken on 2022-03-06 19:15:56 (1606 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting