[00:02:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298555)', diff saved to https://phabricator.wikimedia.org/P28246 and previous config saved to /var/cache/conftool/dbconfig/20220522-000225-ladsgroup.json
[00:02:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[00:02:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[00:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:02:31] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[00:02:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:02:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28247 and previous config saved to /var/cache/conftool/dbconfig/20220522-000607-ladsgroup.json
[00:06:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28248 and previous config saved to /var/cache/conftool/dbconfig/20220522-002112-ladsgroup.json
[00:21:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[00:21:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[00:21:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:17] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[00:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28249 and previous config saved to /var/cache/conftool/dbconfig/20220522-002120-ladsgroup.json
[00:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:33:07] <icinga-wm>	 PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 5938 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops
[00:36:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:03] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:06:51] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1003.wikimedia.org.service,rsync-data-backup-gitlab1003.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:07:33] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:13:21] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:15:41] <icinga-wm>	 RECOVERY - Disk space on an-master1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops
[01:24:38] <TheresNoTime>	 DannyS712: FYI T308927 T308943
[01:24:38] <stashbot>	 T308927: quibble-vendor-mysql-php72-selenium-docker: "cannot create directory ‘log’: Permission denied" - https://phabricator.wikimedia.org/T308927
[01:24:39] <stashbot>	 T308943: CI fails with 'This change or one of its cross-repo dependencies was unable to be automatically merged' for a lot of repos - https://phabricator.wikimedia.org/T308943
[01:26:17] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:02:09] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:05:05] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:13:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[02:13:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[02:13:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:14:33] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:01:41] <icinga-wm>	 PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 6 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[03:03:57] <icinga-wm>	 RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[03:06:15] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:28:39] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:22:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[04:22:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[04:22:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:22:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298555)', diff saved to https://phabricator.wikimedia.org/P28250 and previous config saved to /var/cache/conftool/dbconfig/20220522-042249-ladsgroup.json
[04:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:22:56] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[04:57:32] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:20:14] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:36:52] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 #page on db1127 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1361.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:37:19] <marostegui>	 mmmm
[05:37:29] <marostegui>	 schema change?
[05:37:35] <marostegui>	 checking
[05:38:33] <rzl>	 👋
[05:38:37] <rzl>	 quite a day
[05:39:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P28251 and previous config saved to /var/cache/conftool/dbconfig/20220522-053905-marostegui.json
[05:39:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:39:13] <marostegui>	 Depooled just in case
[05:40:00] <marostegui>	 Yes, it was a schema change
[05:40:15] <marostegui>	 The host came out from downtime earlier than expect
[05:40:24] <marostegui>	 Amir1: please adjust the downtime, this can page again
[05:40:25] <rzl>	 ahh that'll do it
[05:40:39] <rzl>	 need anything, or are you all set?
[05:40:48] <marostegui>	 rzl: it is ok, you can go back to your life
[05:40:55] <marostegui>	 thanks for showing up
[05:40:57] <rzl>	 thanks <3 have a good morning!
[05:43:24] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:43:43] <marostegui>	 I am going to leave the host depooled as I don't have time to wait for it to catch up and then repool it, i might do later today or tomorrow during work hours
[05:43:47] <marostegui>	 Amir1: ^
[05:44:42] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:47:03] <wikibugs>	 (03PS1) 10Marostegui: db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/794808
[05:48:04] <wikibugs>	 (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/794808 (owner: 10Marostegui)
[05:48:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/794808 (owner: 10Marostegui)
[05:49:14] <marostegui>	 I am going back to my life
[05:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:52:30] <marostegui>	 mmm I think it wasn't the schema change but some sort of storage problem, the raid has many errors
[05:54:45] <marostegui>	 Created this: https://phabricator.wikimedia.org/T308965
[06:11:08] <wikibugs>	 10SRE, 10Bengali-Sites, 10User-Urbanecm, 10Wiki-Setup (Create): Create a new wiki for Wikimedia Bangladesh - https://phabricator.wikimedia.org/T33096 (10Ahmad_Kanik)
[06:29:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:36:55] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-05-22-062659-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/794890 (https://phabricator.wikimedia.org/T290847)
[06:42:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298555)', diff saved to https://phabricator.wikimedia.org/P28252 and previous config saved to /var/cache/conftool/dbconfig/20220522-064232-ladsgroup.json
[06:42:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[06:42:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[06:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:40] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[06:42:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28253 and previous config saved to /var/cache/conftool/dbconfig/20220522-064240-ladsgroup.json
[06:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220522T0700)
[07:02:31] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:08:23] <Amir1>	 marostegui: morning. Let me see. I made the downtime to 16 hours last time 
[07:09:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:11:38] <Amir1>	 The schema change on it was finished yesterday 6 am from what I'm seeing 
[07:16:10] <marostegui>	 Amir1: see above, it is storage related 
[07:16:51] <Amir1>	 I know I just wanted to make sure I didn't mess up anything on top
[07:21:08] <Amir1>	 Thankfully it seems I didn't 😁
[07:42:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28254 and previous config saved to /var/cache/conftool/dbconfig/20220522-074255-ladsgroup.json
[07:42:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[07:42:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[07:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:02] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[07:43:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298555)', diff saved to https://phabricator.wikimedia.org/P28255 and previous config saved to /var/cache/conftool/dbconfig/20220522-074303-ladsgroup.json
[07:43:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:29] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:03:41] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:11:45] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:21:27] <wikibugs>	 (03PS3) 10KartikMistry: Enable ContentTranslation as default for cs, el, he, ko and tr WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793444 (https://phabricator.wikimedia.org/T298239)
[08:40:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[08:40:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[08:40:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:40:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T303603)', diff saved to https://phabricator.wikimedia.org/P28256 and previous config saved to /var/cache/conftool/dbconfig/20220522-084036-ladsgroup.json
[08:40:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:47] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[08:50:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28257 and previous config saved to /var/cache/conftool/dbconfig/20220522-085056-ladsgroup.json
[08:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:02] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[09:06:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28258 and previous config saved to /var/cache/conftool/dbconfig/20220522-090601-ladsgroup.json
[09:06:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T303603)', diff saved to https://phabricator.wikimedia.org/P28259 and previous config saved to /var/cache/conftool/dbconfig/20220522-090811-ladsgroup.json
[09:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:19] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[09:21:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28260 and previous config saved to /var/cache/conftool/dbconfig/20220522-092106-ladsgroup.json
[09:21:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P28261 and previous config saved to /var/cache/conftool/dbconfig/20220522-092317-ladsgroup.json
[09:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:29] <wikibugs>	 (03PS1) 10Majavah: Separate metricsinfra nodes from prometheus_nodes on cloud [puppet] - 10https://gerrit.wikimedia.org/r/795143
[09:34:08] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35466/console" [puppet] - 10https://gerrit.wikimedia.org/r/795143 (owner: 10Majavah)
[09:35:06] <wikibugs>	 (03PS2) 10Majavah: Separate metricsinfra nodes from prometheus_nodes on cloud [puppet] - 10https://gerrit.wikimedia.org/r/795143
[09:36:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28262 and previous config saved to /var/cache/conftool/dbconfig/20220522-093611-ladsgroup.json
[09:36:13] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35467/console" [puppet] - 10https://gerrit.wikimedia.org/r/795143 (owner: 10Majavah)
[09:36:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:36:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:17] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[09:36:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28263 and previous config saved to /var/cache/conftool/dbconfig/20220522-093619-ladsgroup.json
[09:36:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:03] <icinga-wm>	 PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 6 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:38:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P28264 and previous config saved to /var/cache/conftool/dbconfig/20220522-093822-ladsgroup.json
[09:38:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:23] <icinga-wm>	 RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[09:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:53:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T303603)', diff saved to https://phabricator.wikimedia.org/P28265 and previous config saved to /var/cache/conftool/dbconfig/20220522-095327-ladsgroup.json
[09:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:34] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[09:54:49] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:04:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298555)', diff saved to https://phabricator.wikimedia.org/P28266 and previous config saved to /var/cache/conftool/dbconfig/20220522-100429-ladsgroup.json
[10:04:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[10:04:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[10:04:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:34] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[10:04:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298555)', diff saved to https://phabricator.wikimedia.org/P28267 and previous config saved to /var/cache/conftool/dbconfig/20220522-100436-ladsgroup.json
[10:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:56] <wikibugs>	 (03PS4) 10Majavah: metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299)
[10:10:58] <wikibugs>	 (03PS1) 10Majavah: P:metricsinfra::alertmanager: proxy access for trusted projects [puppet] - 10https://gerrit.wikimedia.org/r/795192 (https://phabricator.wikimedia.org/T304716)
[10:14:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:metricsinfra::alertmanager: proxy access for trusted projects [puppet] - 10https://gerrit.wikimedia.org/r/795192 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[10:14:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) (owner: 10Majavah)
[10:19:09] <wikibugs>	 (03PS2) 10Majavah: P:metricsinfra::alertmanager: proxy access for trusted projects [puppet] - 10https://gerrit.wikimedia.org/r/795192 (https://phabricator.wikimedia.org/T304716)
[10:19:10] <wikibugs>	 (03PS5) 10Majavah: metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299)
[10:26:11] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:26:51] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:28:49] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:30:39] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48108 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:30:57] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 25 Jun 2022 07:55:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:31:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:41:33] <icinga-wm>	 PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 6 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:43:51] <icinga-wm>	 RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
[10:56:01] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:19:13] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:50:08] <wikibugs>	 (03PS1) 10Majavah: openstack::designate: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795356 (https://phabricator.wikimedia.org/T297268)
[11:50:10] <wikibugs>	 (03PS1) 10Majavah: openstack::neutron: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795357 (https://phabricator.wikimedia.org/T297268)
[11:50:12] <wikibugs>	 (03PS1) 10Majavah: openstack::nova: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795358 (https://phabricator.wikimedia.org/T297268)
[11:52:46] <wikibugs>	 (03PS1) 10Majavah: openstack::trove: enable rabbitmq tls for api [puppet] - 10https://gerrit.wikimedia.org/r/795361 (https://phabricator.wikimedia.org/T297268)
[11:55:07] <wikibugs>	 (03PS1) 10Majavah: cloudweb2002-dev is not behind LVS [puppet] - 10https://gerrit.wikimedia.org/r/795365
[11:56:23] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35468/console" [puppet] - 10https://gerrit.wikimedia.org/r/795365 (owner: 10Majavah)
[11:57:36] <wikibugs>	 (03PS1) 10Majavah: Revert "Horizon: include openstack bpos on cloudweb hosts" [puppet] - 10https://gerrit.wikimedia.org/r/795249
[11:57:45] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:58:46] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35469/console" [puppet] - 10https://gerrit.wikimedia.org/r/795249 (owner: 10Majavah)
[12:16:51] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:20:29] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:24:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298555)', diff saved to https://phabricator.wikimedia.org/P28269 and previous config saved to /var/cache/conftool/dbconfig/20220522-122402-ladsgroup.json
[12:24:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[12:24:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[12:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:09] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[12:24:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298555)', diff saved to https://phabricator.wikimedia.org/P28270 and previous config saved to /var/cache/conftool/dbconfig/20220522-122410-ladsgroup.json
[12:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:58] <wikibugs>	 (03PS1) 10Majavah: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/795380
[12:26:15] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35470/console" [puppet] - 10https://gerrit.wikimedia.org/r/795380 (owner: 10Majavah)
[12:36:51] <icinga-wm>	 PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 3679 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops
[12:58:09] <icinga-wm>	 RECOVERY - Disk space on an-master1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops
[12:58:55] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:12:49] * Krinkle testing on mwdebug1002
[13:14:55] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[13:15:39] <wikibugs>	 (03Merged) 10jenkins-bot: MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[13:17:16] <logmsgbot>	 !log krinkle@deploy1002 scap failed: average error rate on 7/8 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details)
[13:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:50] <logmsgbot>	 !log krinkle@deploy1002 Scap failed!: 7/8 canaries failed their endpoint checks(https://en.wikipedia.org).  WARNING: canaries have not been rolled back.
[13:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:20:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:20:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:49] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:37] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: I97878f8e6 (duration: 00m 50s)
[13:25:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:58] <logmsgbot>	 !log krinkle@deploy1002 Synchronized multiversion/: I3759179dba75a9419 (duration: 00m 53s)
[13:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:29] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:36:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:40:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:44:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:27] <wikibugs>	 (03PS11) 10Krinkle: Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[13:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:55:45] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Make IS.php return an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793869 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[13:56:33] <wikibugs>	 (03Merged) 10jenkins-bot: Make IS.php return an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793869 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[13:59:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:55] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I31b1bfb1808b9523 (duration: 00m 52s)
[14:02:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:03:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:04] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[14:04:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: include openstack bpos on cloudweb hosts" [puppet] - 10https://gerrit.wikimedia.org/r/795249 (owner: 10Majavah)
[14:04:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[14:05:22] <wikibugs>	 (03PS12) 10Krinkle: Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[14:05:29] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[14:06:52] <wikibugs>	 (03Merged) 10jenkins-bot: Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup)
[14:07:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:52] <logmsgbot>	 !log krinkle@deploy1002 Synchronized multiversion/: Ia0a6d4794faaafcb (1/2) (duration: 00m 50s)
[14:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:14:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:14:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:32] <logmsgbot>	 !log krinkle@deploy1002 scap failed: average error rate on 3/8 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details)
[14:14:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:15:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "Correct, wikitech-static is in ORD now" [dns] - 10https://gerrit.wikimedia.org/r/793729 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans)
[14:18:03] <Amir1>	 we will see an increase in 500x
[14:18:06] <Amir1>	 that's fine
[14:18:10] <Amir1>	 it'll recover
[14:18:23] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/: Ia0a6d4794faaafcb (2/2) (duration: 00m 42s)
[14:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:23] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:02] <logmsgbot>	 !log krinkle@deploy1002 Synchronized docroot/noc/: Ia0a6d4794faaafc (duration: 00m 50s)
[14:23:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:23] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:27:55] <logmsgbot>	 !log krinkle@deploy1002 Synchronized src/: Ia0a6d4794faaafc (duration: 00m 50s)
[14:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:17] <wikibugs>	 (03Restored) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung)
[14:34:24] <wikibugs>	 (03Restored) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung)
[14:34:35] <wikibugs>	 (03PS2) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416
[14:34:43] <wikibugs>	 (03Abandoned) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (owner: 10Winston Sung)
[14:34:53] <wikibugs>	 (03PS4) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606
[14:34:58] <wikibugs>	 (03Abandoned) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (owner: 10Winston Sung)
[14:37:19] <wikibugs>	 (03Restored) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung)
[14:37:26] <wikibugs>	 (03PS2) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417
[14:37:30] <wikibugs>	 (03Abandoned) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (owner: 10Winston Sung)
[14:37:33] <wikibugs>	 (03Restored) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung)
[14:37:40] <wikibugs>	 (03PS2) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608
[14:37:52] <wikibugs>	 (03Restored) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung)
[14:37:59] <wikibugs>	 (03PS3) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418
[14:38:04] <wikibugs>	 (03Abandoned) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (owner: 10Winston Sung)
[14:38:13] <wikibugs>	 (03Restored) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung)
[14:38:26] <wikibugs>	 (03PS3) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610
[14:38:32] <wikibugs>	 (03Abandoned) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (owner: 10Winston Sung)
[14:46:49] <wikibugs>	 (03CR) 10Winston Sung: "The Depends-on has been abandoned, please abandon this change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775423 (https://phabricator.wikimedia.org/T273578) (owner: 10Func)
[14:46:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro)
[14:48:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298555)', diff saved to https://phabricator.wikimedia.org/P28272 and previous config saved to /var/cache/conftool/dbconfig/20220522-144847-ladsgroup.json
[14:48:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[14:48:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[14:48:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:54] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[14:48:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298555)', diff saved to https://phabricator.wikimedia.org/P28273 and previous config saved to /var/cache/conftool/dbconfig/20220522-144855-ladsgroup.json
[14:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] cloudvirt-libvirt-stats: Avoid printing to stdout [puppet] - 10https://gerrit.wikimedia.org/r/790388 (owner: 10David Caro)
[14:54:52] <wikibugs>	 (03PS1) 10Stang: zhwiki: Enable RCPatrol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/795526 (https://phabricator.wikimedia.org/T308976)
[15:35:47] <wikibugs>	 (03Abandoned) 10Func: Use variants fallback to define logos for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775423 (https://phabricator.wikimedia.org/T273578) (owner: 10Func)
[16:11:49] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:48:23] <wikibugs>	 (03Abandoned) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (owner: 10Winston Sung)
[17:01:23] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Through machine translation, I understand the consensus." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/795526 (https://phabricator.wikimedia.org/T308976) (owner: 10Stang)
[17:14:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298555)', diff saved to https://phabricator.wikimedia.org/P28274 and previous config saved to /var/cache/conftool/dbconfig/20220522-171444-ladsgroup.json
[17:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:49] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[17:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:05:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28275 and previous config saved to /var/cache/conftool/dbconfig/20220522-180506-ladsgroup.json
[18:05:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:13] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[18:20:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28276 and previous config saved to /var/cache/conftool/dbconfig/20220522-182011-ladsgroup.json
[18:20:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28277 and previous config saved to /var/cache/conftool/dbconfig/20220522-183516-ladsgroup.json
[18:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:32] <zabe>	 seems like wikibugs died an hour ago and is still not back
[18:50:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28278 and previous config saved to /var/cache/conftool/dbconfig/20220522-185021-ladsgroup.json
[18:50:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:28] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[18:59:37] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:46:52] <icinga-wm>	 PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 4040 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail
[19:47:55] <_joe_>	 o/
[19:48:33] * jbond here, looking
[19:48:56] <mutante>	 here. so there was recent change to switch mx alerting. looking if that was merged
[19:49:13] <mutante>	 ah, no. not merged yet
[19:49:39] <rzl>	 👋 
[19:49:41] <jbond>	  2399   118MB     43h      0m  tools.wmflabs.org
[19:54:14] <zabe>	 is it libup bot again?
[19:55:27] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:56:37] <mutante>	 zabe: yea
[19:57:37] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:58:49] <icinga-wm>	 ACKNOWLEDGEMENT - exim queue #page on mx1001 is CRITICAL: CRITICAL: 4040 mails in exim queue. daniel_zahn https://phabricator.wikimedia.org/T306295 https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail
[20:00:53] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:02:34] * Krinkle testing on mwdebug1002
[20:18:10] <icinga-wm>	 RECOVERY - exim queue #page on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail
[20:29:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:34] <logmsgbot>	 !log krinkle@deploy1002 Synchronized src/XhguiSaverPdo.php: I3882be35572 (duration: 00m 50s)
[20:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:31:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:42] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/profiler.php: I3882be35572 (duration: 00m 51s)
[20:32:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:02] <logmsgbot>	 !log krinkle@deploy1002 Synchronized lib/: I3882be35572 (duration: 00m 50s)
[20:34:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:34] <logmsgbot>	 !log krinkle@deploy1002 Synchronized src/Profiler.php: I14c5a9aa39 (duration: 00m 49s)
[20:41:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:36] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/: I14c5a9aa39 (duration: 00m 50s)
[20:42:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:42:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:43:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:46:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:24] * Krinkle done testing
[21:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:20:59] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:30:19] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:46:09] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 46.87 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:46:47] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 56.37 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:48:29] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:49:07] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 81.26 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[23:52:13] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:55:53] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state