[00:12:03] <icinga-wm>	 PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-fetchimage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:17] <icinga-wm>	 PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops
[00:53:27] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:17:21] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:18:17] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:26:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:19:27] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:36:23] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:36:47] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:38:27] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:38:51] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48107 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:18:05] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:39:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset
[04:39:59] <logmsgbot>	 !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99)
[04:40:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset
[04:40:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:33] <logmsgbot>	 !log marostegui@cumin1001 Updating IPMI password on 1 hosts - marostegui@cumin1001
[04:40:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0)
[04:40:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:51:02] <marostegui>	 akosiaris: ^ enjoy
[04:53:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[04:53:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[04:54:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:54:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28280 and previous config saved to /var/cache/conftool/dbconfig/20220523-045404-ladsgroup.json
[04:54:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:54:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:54:12] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[04:55:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[04:55:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[04:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28281 and previous config saved to /var/cache/conftool/dbconfig/20220523-045548-ladsgroup.json
[04:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:58:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118 reimage to bulseye', diff saved to https://phabricator.wikimedia.org/P28282 and previous config saved to /var/cache/conftool/dbconfig/20220523-045850-marostegui.json
[04:58:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[05:03:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[05:03:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T303603)', diff saved to https://phabricator.wikimedia.org/P28283 and previous config saved to /var/cache/conftool/dbconfig/20220523-050341-ladsgroup.json
[05:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:46] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[05:06:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T303603)', diff saved to https://phabricator.wikimedia.org/P28284 and previous config saved to /var/cache/conftool/dbconfig/20220523-050624-ladsgroup.json
[05:06:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:06:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1118.eqiad.wmnet with OS bullseye
[05:06:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:09:27] * kart_ updating cxserver
[05:12:29] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:15:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1118.eqiad.wmnet with reason: host reimage
[05:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:18:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1118.eqiad.wmnet with reason: host reimage
[05:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:20:58] <taavi>	 jouncebot: nowandnext
[05:20:58] <jouncebot>	 For the next 1 hour(s) and 39 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220522T0700)
[05:20:59] <jouncebot>	 In 1 hour(s) and 39 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T0700)
[05:21:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28285 and previous config saved to /var/cache/conftool/dbconfig/20220523-052130-ladsgroup.json
[05:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:24:08] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:24:43] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:38] <logmsgbot>	 !log taavi@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/WikimediaMaintenance/fixT308895BrokenRenames.php: Backport: [[gerrit:793800|Add a script to fix T308895 renames (T308895)]] (duration: 00m 51s)
[05:25:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:43] <stashbot>	 T308895: GlobalRename not renaming some accounts - https://phabricator.wikimedia.org/T308895
[05:26:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[05:26:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[05:27:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[05:27:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:53] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:27:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[05:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:28:52] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:28:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:49] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:45] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:31:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:01] <kart_>	 !log Updated cxserver to 2022-05-22-062659-production (T290847)
[05:33:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:06] <stashbot>	 T290847: Generate template parameter alignments for languages of interest to Section Translation - https://phabricator.wikimedia.org/T290847
[05:35:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1118.eqiad.wmnet with OS bullseye
[05:35:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:35:43] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:36:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28286 and previous config saved to /var/cache/conftool/dbconfig/20220523-053635-ladsgroup.json
[05:36:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P28287 and previous config saved to /var/cache/conftool/dbconfig/20220523-054311-root.json
[05:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:51:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1172.eqiad.wmnet with OS bullseye
[05:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T303603)', diff saved to https://phabricator.wikimedia.org/P28288 and previous config saved to /var/cache/conftool/dbconfig/20220523-055140-ladsgroup.json
[05:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:45] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[05:53:54] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:57:10] <Amir1>	 jouncebot: nowandnext
[05:57:10] <jouncebot>	 For the next 1 hour(s) and 2 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220522T0700)
[05:57:10] <jouncebot>	 In 1 hour(s) and 2 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T0700)
[05:57:26] <Amir1>	 cool. Going to deploy stuff
[05:58:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P28289 and previous config saved to /var/cache/conftool/dbconfig/20220523-055815-root.json
[05:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:20] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788385|TimedMediaHandler: Disabled the BetaFeature from wikis (T248418)]] (duration: 00m 51s)
[06:00:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:26] <stashbot>	 T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418
[06:00:36] <urbanecm>	 good morning Amir1, would you mind pinging me when you're done? 
[06:00:46] <Amir1>	 urbanecm: good morning, sure
[06:01:23] <urbanecm>	 thanks
[06:02:02] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:612350|TimedMediaHandler: Drop Beta Feature, no longer usable (T248418)]] (duration: 00m 52s)
[06:02:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage
[06:02:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[06:04:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:13] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:612351|TimedMediaHandler: Don't read wmgTmhWebPlayer (T248418)]] (duration: 00m 50s)
[06:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage
[06:04:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[06:04:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[06:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:59] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:612352|TimedMediaHandler: Drop pre-switch config, no longer read (T248418)]] (duration: 00m 54s)
[06:07:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:04] <stashbot>	 T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418
[06:07:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[06:07:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:43] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793763|Turn on WRITE BOTH for templatelink migration in enwiki (T299421)]] (duration: 00m 51s)
[06:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:49] <stashbot>	 T299421: Turn on write both in production for templatelinks normalization - https://phabricator.wikimedia.org/T299421
[06:12:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[06:12:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:41] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:13:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P28290 and previous config saved to /var/cache/conftool/dbconfig/20220523-061319-root.json
[06:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[06:13:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[06:13:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:13:49] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:14:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[06:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:22:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1172.eqiad.wmnet with OS bullseye
[06:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P28291 and previous config saved to /var/cache/conftool/dbconfig/20220523-062822-root.json
[06:28:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:46] <urbanecm>	 !log urbanecm@mwmaint1002:~$ foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/migrateMenteeOverviewFiltersToPresets.php --update # T304057
[06:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:51] <stashbot>	 T304057: Migrate growthexperiments-mentee-overview-filters to growthexperiments-mentee-overview-presets - https://phabricator.wikimedia.org/T304057
[06:35:22] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:791605|Remove unused OggThumbLocation config variable (T308191)]] (duration: 00m 51s)
[06:35:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:26] <stashbot>	 T308191: Remove oggThumb from TMH - https://phabricator.wikimedia.org/T308191
[06:36:33] <Amir1>	 urbanecm: I'm done finally
[06:36:44] <urbanecm>	 thanks
[06:37:53] <urbanecm>	 my should be fairly quick
[06:38:40] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 00m 52s)
[06:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:32] * urbanecm done
[06:39:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[06:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[06:40:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[06:40:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P28292 and previous config saved to /var/cache/conftool/dbconfig/20220523-064326-root.json
[06:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:44:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[06:44:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:01] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:48:49] <icinga-wm>	 RECOVERY - Check that envoy is running on idp-test2002 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[06:49:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[06:49:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[06:50:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[06:50:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[06:51:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P28293 and previous config saved to /var/cache/conftool/dbconfig/20220523-065830-root.json
[06:58:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[06:58:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[06:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T0700).
[07:00:04] <jouncebot>	 kart_ and DannyS712: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:52] * kart_ is here
[07:01:17] * DannyS712 is here
[07:01:42] <kart_>	 I'll start with +2 to wmf.12 patch and CI will take few minutes - meanwhile will deploy config patch.
[07:02:17] <DannyS712>	 can I add a 5th patch (makes it 7 total) - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796353 ?
[07:02:37] <DannyS712>	 (all of my patches are phpcs cleanup and shouldn't actually change anything)
[07:02:50] <kart_>	 DannyS712: sure. Go ahead.
[07:03:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28294 and previous config saved to /var/cache/conftool/dbconfig/20220523-070314-ladsgroup.json
[07:03:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:20] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[07:06:26] <hashar>	 good morning
[07:06:29] <kart_>	 Looks like mw-config patch merged notification no longer appear here?
[07:08:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Razzi out of all services on: 562 hosts
[07:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Razzi out of all services on: 562 hosts
[07:09:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Razzi out of all services on: 1227 hosts
[07:09:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:34] <kart_>	 Deploying config patch..
[07:09:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Razzi out of all services on: 1227 hosts
[07:09:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:57] <logmsgbot>	 !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793444|Enable ContentTranslation as default for cs, el, he, ko and tr WPs (T298239 T304853 T304854 T304855 T304863)]] (duration: 00m 50s)
[07:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:06] <stashbot>	 T304854: Enable Content and Section Translation for Greek Wikipedia - https://phabricator.wikimedia.org/T304854
[07:10:07] <stashbot>	 T304855: Enable Content and Section Translation for Czech Wikipedia - https://phabricator.wikimedia.org/T304855
[07:10:08] <stashbot>	 T304863: Enable Content and Section Translation for Hebrew Wikipedia - https://phabricator.wikimedia.org/T304863
[07:10:08] <stashbot>	 T304853: Enable Content and Section Translation for Turkish Wikipedia - https://phabricator.wikimedia.org/T304853
[07:10:08] <stashbot>	 T298239: Enable Content and Section Translation for Korean Wikipedia - https://phabricator.wikimedia.org/T298239
[07:11:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:11:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:06] <kart_>	 DannyS712: Waiting for CI for wmf.12 patch now..
[07:12:33] <kart_>	 Seems 8 minutes..
[07:13:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:13:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P28295 and previous config saved to /var/cache/conftool/dbconfig/20220523-071334-root.json
[07:13:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:14:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:14:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:43] <DannyS712>	 kart_ okay. Can I add a 6th patch for me / 8th overall? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796354
[07:16:37] <DannyS712>	 (I know normally its a max of 6 patches but since these are no-ops I thought it might be okay)
[07:17:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28296 and previous config saved to /var/cache/conftool/dbconfig/20220523-071728-ladsgroup.json
[07:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:31] <kart_>	 DannyS712: As long as it can fit into window and no-ops :)
[07:17:32] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[07:18:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28297 and previous config saved to /var/cache/conftool/dbconfig/20220523-071819-ladsgroup.json
[07:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:39] <DannyS712>	 okay, then I'll keep adding patches and we'll see what we get to. I think your wmf.12 patch merged
[07:22:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:22:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:23:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:24] <kart_>	 Testing my patch on mwdebug1001 
[07:24:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:59] <kart_>	 Deploying now..
[07:25:40] <logmsgbot>	 !log kartik@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/ContentTranslation/modules/base/mw.cx.SiteMapper.js: Backport: [[gerrit:796351|Sitemapper: Fix the configuration override (T308802)]] (duration: 00m 51s)
[07:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:46] <stashbot>	 T308802: Content Translation redirects back to "Start translation" when dashboard is loaded from contribution menu - https://phabricator.wikimedia.org/T308802
[07:26:24] <kart_>	 DannyS712: I'm done.
[07:27:34] <DannyS712>	 okay. Just realized - you were self-deploying your patches, but I can't do that for my own patches because I don't have deployment rights
[07:28:04] <kart_>	 Oh, I thought you're doing it yourself :/
[07:28:14] <kart_>	 Is urbanecm around?
[07:28:40] <urbanecm>	 Yes. What's up?
[07:28:58] <kart_>	 urbanecm: DannyS712's patches need help.
[07:29:14] <DannyS712>	 help = deployment
[07:29:35] <kart_>	 Oh yeah.
[07:29:46] <kart_>	 I need to go to Lunch in few minutes.
[07:30:03] <urbanecm>	 well, let's have a look then
[07:30:04] <urbanecm>	 jouncebot: now
[07:30:04] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T0700)
[07:30:16] <DannyS712>	 all no-op phpcs cleanup
[07:30:36] <kart_>	 DannyS712: all patches has CI failure in one of checks?
[07:31:03] <urbanecm>	 kart_: that's because they're no-ops. 
[07:31:30] <urbanecm>	 operations-mw-config-php72-composer-diffConfig-docker expects a change to be made by a config change, which is a reasonable assumption, but in this case, it's okay there is no change :)
[07:31:34] <urbanecm>	 where are wikibugs btw?
[07:31:34] <kart_>	 Haven't look at in code, sorry :/
[07:31:38] <urbanecm>	 no problem :)
[07:32:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P28298 and previous config saved to /var/cache/conftool/dbconfig/20220523-073233-ladsgroup.json
[07:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28299 and previous config saved to /var/cache/conftool/dbconfig/20220523-073324-ladsgroup.json
[07:33:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:34:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:33] <icinga-wm>	 PROBLEM - Memcached on idp-test1002 is CRITICAL: connect to address 208.80.154.72 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[07:35:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:35:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:35:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:45] <DannyS712>	 urbanecm I have more than the 4 patches that would meet the 6 maximum normally imposed for backport windows, but since these are all no-ops would you be willing to deploy more than the 4?
[07:36:02] <urbanecm>	 DannyS712: i'm reviewing them all, we should be able to do them
[07:36:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:36:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:25] <urbanecm>	 it's easier to deploy obvious no-ops like those patches :)
[07:38:01] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized private/readme.php: 7a8d8a06: phpcs: move DisallowYodaConditions exclusion inline (duration: 00m 49s)
[07:38:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:15] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: e6fb9266: phpcs: enable FunctionComment.MissingDocumentationPrivate (duration: 01m 30s)
[07:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:44] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: 8f8b04e0: phpcs: enable PropertyDocumentation.WrongStyle (duration: 00m 49s)
[07:41:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:42:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:42:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:33] <DannyS712>	 there are a few later patches that are not on the wikitech page but still in the same relation chain, I'll update wikitech with what is actually getting deployed at the end
[07:42:35] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized w/fatal-error.php: 8f8b04e0: phpcs: enable PropertyDocumentation.WrongStyle (duration: 00m 50s)
[07:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:43] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized w/fatal-error.php: 7c28808: phpcs: enable and suppress DuplicateClassName.Found (duration: 00m 48s)
[07:43:45] <urbanecm>	 DannyS712: does that mean you want me to review&deploy more patches than what's at the wikitech page?
[07:43:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:09] <DannyS712>	 urbanecm if you're willing, yes
[07:44:25] <urbanecm>	 DannyS712: in that case, please list them in the calendar :)
[07:44:44] <DannyS712>	 okay, its https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796356/2 and the follow-up to that and 1 more I'll create in a second
[07:46:31] <DannyS712>	 third one is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796358
[07:47:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P28300 and previous config saved to /var/cache/conftool/dbconfig/20220523-074739-ladsgroup.json
[07:47:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:19] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized src/: 0e012139: phpcs: enable PropertyDocumentation.MissingDocumentationPrivate (duration: 00m 50s)
[07:48:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:48:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28301 and previous config saved to /var/cache/conftool/dbconfig/20220523-074829-ladsgroup.json
[07:48:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[07:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[07:48:33] <DannyS712>	 also 4th https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796359
[07:48:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:37] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[07:48:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28302 and previous config saved to /var/cache/conftool/dbconfig/20220523-074837-ladsgroup.json
[07:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:09] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized w/fatal-error.php: 0e012139: phpcs: enable PropertyDocumentation.MissingDocumentationPrivate (duration: 00m 49s)
[07:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:49:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:49:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:31] <RhinosF1>	 urbanecm: wikibugs is https://phabricator.wikimedia.org/T308995
[07:49:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:53] <DannyS712>	 deployment calendar updated
[07:49:55] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet
[07:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:00] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: 0e012139: phpcs: enable PropertyDocumentation.MissingDocumentationPrivate (duration: 00m 50s)
[07:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:34] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized w/fatal-error.php: a888904: phpcs: enable and suppress ClassMatchesFilename.NotMatch (duration: 00m 49s)
[07:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:24] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: a888904: phpcs: enable and suppress ClassMatchesFilename.NotMatch (duration: 00m 49s)
[07:52:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:25] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet
[07:53:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:34] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 86d08457: phpcs: move ForbiddenFunctions.extract exclusion inline (duration: 00m 50s)
[07:53:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:21] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized docroot/noc/conf/activeMWVersions.php: e1df8fabc: phpcs: move ForbiddenFunctions.exec exclusion inline (duration: 00m 50s)
[07:55:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:55:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:04] <urbanecm>	 DannyS712: and that should be it :)
[07:56:25] <DannyS712>	 do you have time for one more? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796360/4
[07:56:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:56:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:56:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:23] <urbanecm>	 DannyS712: we've less than three minutes, that's not enough unfortunately.
[07:57:32] <DannyS712>	 okay, then next time
[07:57:36] <urbanecm>	 yup :)
[07:57:51] <DannyS712>	 still, I got 11 patches merged in record time
[07:58:22] <DannyS712>	 can I add this one to the UTC late backport window today that you are deploying? I might not be around then but it should still be a no-op
[07:59:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:59:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:09] <urbanecm>	 DannyS712: i can't guarantee it'll be me actually doing the deployment though
[07:59:32] <DannyS712>	 okay, I'll list it there and hope for the best, I might be able to make it
[08:00:27] <DannyS712>	 thanks for reviewing and deploying! :)
[08:02:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28303 and previous config saved to /var/cache/conftool/dbconfig/20220523-080244-ladsgroup.json
[08:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:50] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[08:04:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:04:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:05:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:05:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:06:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:03] <taavi>	 !log fixing renames of 44 accounts T308895
[08:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:10] <stashbot>	 T308895: GlobalRename not renaming some accounts - https://phabricator.wikimedia.org/T308895
[08:14:03] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:14:29] <DannyS712>	 can confirm that there are rename logs showing up on enwikiquote
[08:19:51] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:16] <icinga-wm>	 RECOVERY - Disk space on gitlab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops
[08:37:30] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:44:22] <icinga-wm>	 PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:01:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[09:01:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[09:01:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5003.eqsin.wmnet
[09:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:56] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:18:51] <elukey>	 .11
[09:18:54] <elukey>	 uff :)
[09:22:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5003.eqsin.wmnet
[09:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:16] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:24:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5003.eqsin.wmnet to ganeti01.svc.eqsin.wmnet
[09:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:46] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5003.eqsin.wmnet to ganeti01.svc.eqsin.wmnet
[09:25:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:57] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab1003.wikimedia.org with OS bullseye
[09:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:36:40] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:37:46] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage
[09:37:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet
[09:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:42] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:38:56] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:40:35] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage
[09:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:42:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet
[09:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:33] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[09:44:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:02] <jinxer-wm>	 (MXQueueNoMetrics) firing: Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics
[09:45:46] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:49:24] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:49:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:02] <jinxer-wm>	 (MXQueueNoMetrics) firing: (2) Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics
[09:50:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:51:25] <godog>	 hah! the mxqueuenometrics makes sense, I'll fix it
[09:54:32] <moritzm>	 !log failover ganeti master in eqsin to ganeti5003 T308211
[09:54:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:38] <stashbot>	 T308211: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211
[09:55:02] <jinxer-wm>	 (MXQueueNoMetrics) firing: (8) Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics
[09:55:45] <jinxer-wm>	 (JobUnavailable) resolved: (4) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:55:48] <moritzm>	 !log drain ganeti5001 T308211
[09:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:36] <jbond>	 godog: happy for me to merge your cr
[09:56:45] <godog>	 jbond: oops! yes please
[09:56:50] <jbond>	 np doing
[09:57:06] * jbond done
[09:59:44] <icinga-wm>	 RECOVERY - Disk space on gitlab1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1003&var-datasource=eqiad+prometheus/ops
[10:00:02] <jinxer-wm>	 (MXQueueNoMetrics) firing: (6) Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics
[10:00:04] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1003.wikimedia.org with OS bullseye
[10:00:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:37] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:02:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet
[10:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:26] <moritzm>	 ^ this includes a restart of kubetcd1005 since not on DRBD
[10:04:35] <icinga-wm>	 PROBLEM - Host kubetcd1005 is DOWN: PING CRITICAL - Packet loss = 100%
[10:05:34] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti5001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[10:05:45] <icinga-wm>	 RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms
[10:07:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet
[10:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28306 and previous config saved to /var/cache/conftool/dbconfig/20220523-100809-ladsgroup.json
[10:08:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:15] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[10:09:12] <hnowlan>	 !log starting reboot of eqiad maps hosts for updates 
[10:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:16] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1005.eqiad.wmnet
[10:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:02] <jinxer-wm>	 (MXQueueNoMetrics) firing: Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics
[10:10:22] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 6 hosts with reason: postgres config change
[10:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:27] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 6 hosts with reason: postgres config change
[10:10:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[10:12:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[10:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28307 and previous config saved to /var/cache/conftool/dbconfig/20220523-101222-ladsgroup.json
[10:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:28] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[10:12:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:15:02] <jinxer-wm>	 (MXQueueNoMetrics) firing: (2) Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics
[10:15:43] <icinga-wm>	 RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:17:12] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet
[10:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:23] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1006.eqiad.wmnet with reason: security update
[10:17:24] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1006.eqiad.wmnet with reason: security update
[10:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:50] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1006.eqiad.wmnet
[10:18:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet
[10:18:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28308 and previous config saved to /var/cache/conftool/dbconfig/20220523-102314-ladsgroup.json
[10:23:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1119.eqiad.wmnet with OS bullseye
[10:24:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:09] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/superset/deploy@09094de]: (no justification provided)
[10:25:12] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/superset/deploy@09094de]: (no justification provided) (duration: 00m 03s)
[10:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1119.eqiad.wmnet with reason: host reimage
[10:33:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1119.eqiad.wmnet with reason: host reimage
[10:35:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:49] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1006.eqiad.wmnet
[10:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28309 and previous config saved to /var/cache/conftool/dbconfig/20220523-103819-ladsgroup.json
[10:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:12] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1007.eqiad.wmnet with reason: security update
[10:40:13] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1007.eqiad.wmnet with reason: security update
[10:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:56] <icinga-wm>	 RECOVERY - Check systemd state on maps1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:48] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:44:40] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:51:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1119.eqiad.wmnet with OS bullseye
[10:51:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28310 and previous config saved to /var/cache/conftool/dbconfig/20220523-105324-ladsgroup.json
[10:53:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[10:53:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[10:53:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:31] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[10:53:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28311 and previous config saved to /var/cache/conftool/dbconfig/20220523-105332-ladsgroup.json
[10:53:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:41] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti1025.eqiad.wmnet
[10:58:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28312 and previous config saved to /var/cache/conftool/dbconfig/20220523-110043-ladsgroup.json
[11:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:49] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[11:01:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet
[11:01:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:58] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1007.eqiad.wmnet
[11:11:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:08] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1008.eqiad.wmnet
[11:11:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:17] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1008.eqiad.wmnet with reason: security update
[11:11:19] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1008.eqiad.wmnet with reason: security update
[11:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:07] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:15:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P28313 and previous config saved to /var/cache/conftool/dbconfig/20220523-111548-ladsgroup.json
[11:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:08] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1008.eqiad.wmnet with reason: security update
[11:18:09] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1008.eqiad.wmnet with reason: security update
[11:18:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1177 to clone db1172', diff saved to https://phabricator.wikimedia.org/P28314 and previous config saved to /var/cache/conftool/dbconfig/20220523-111902-marostegui.json
[11:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:29] <icinga-wm>	 RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:01] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1008.eqiad.wmnet
[11:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:13] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1009.eqiad.wmnet with reason: security update
[11:25:14] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1009.eqiad.wmnet with reason: security update
[11:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P28316 and previous config saved to /var/cache/conftool/dbconfig/20220523-113053-ladsgroup.json
[11:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:03] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.853e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11
[11:38:19] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1010.eqiad.wmnet with reason: security update
[11:38:20] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1010.eqiad.wmnet with reason: security update
[11:38:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:26] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1010.eqiad.wmnet
[11:38:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:10] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti1026.eqiad.wmnet
[11:41:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:35] <icinga-wm>	 RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28317 and previous config saved to /var/cache/conftool/dbconfig/20220523-114559-ladsgroup.json
[11:46:03] <icinga-wm>	 RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:46:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:05] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[11:47:13] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2005.codfw.wmnet
[11:47:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:51:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[11:51:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[11:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28318 and previous config saved to /var/cache/conftool/dbconfig/20220523-115202-ladsgroup.json
[11:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:08] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[11:52:51] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2005.codfw.wmnet
[11:52:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:56:05] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2006.codfw.wmnet
[11:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1134.eqiad.wmnet with OS bullseye
[11:56:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:10] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2006.codfw.wmnet
[12:01:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet
[12:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:08] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:03:40] <icinga-wm>	 PROBLEM - Host kubestagetcd1006 is DOWN: PING CRITICAL - Packet loss = 100%
[12:05:34] <icinga-wm>	 RECOVERY - Host kubestagetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms
[12:06:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1134.eqiad.wmnet with reason: host reimage
[12:06:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1134.eqiad.wmnet with reason: host reimage
[12:09:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:16:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[12:16:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[12:16:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298555)', diff saved to https://phabricator.wikimedia.org/P28320 and previous config saved to /var/cache/conftool/dbconfig/20220523-121659-ladsgroup.json
[12:17:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:07] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[12:18:38] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:20:30] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:20:33] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@c9b397c]: T305843_migrate_clickstream_job_from_oozie_to_airflow [airflow-dags/analytics_test@c9b397c]
[12:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:42] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@c9b397c]: T305843_migrate_clickstream_job_from_oozie_to_airflow [airflow-dags/analytics_test@c9b397c] (duration: 00m 08s)
[12:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:48] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@c9b397c]: T305843_migrate_clickstream_job_from_oozie_to_airflow [airflow-dags/analytics@c9b397c]
[12:23:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:56] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@c9b397c]: T305843_migrate_clickstream_job_from_oozie_to_airflow [airflow-dags/analytics@c9b397c] (duration: 00m 08s)
[12:23:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:29] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:25:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1134.eqiad.wmnet with OS bullseye
[12:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet
[12:28:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:11] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:23] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:39:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28321 and previous config saved to /var/cache/conftool/dbconfig/20220523-123944-ladsgroup.json
[12:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:49] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[12:51:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti5001.eqsin.wmnet with reason: Remove from cluster for firmware update and eventual reimage
[12:51:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti5001.eqsin.wmnet with reason: Remove from cluster for firmware update and eventual reimage
[12:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28322 and previous config saved to /var/cache/conftool/dbconfig/20220523-125449-ladsgroup.json
[12:54:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T1300).
[13:00:05] <jouncebot>	 James_F, koi, and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <James_F>	 My patch has already been deployed.
[13:00:21] <tgr>	 o/
[13:00:27] <James_F>	 Thanks, Amir1. :-)
[13:00:34] <koi>	 hi there
[13:00:53] <tgr>	 I guess I should do the deploys then
[13:04:09] <tgr>	 seems like wikibugs bot is on vacation
[13:07:16] <urbanecm>	 tgr: if you can, would be great :)
[13:07:19] <urbanecm>	 wikibugs is T308995
[13:07:19] <stashbot>	 T308995: wikibugs not show phab/gerrit comments on IRC - https://phabricator.wikimedia.org/T308995
[13:09:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28323 and previous config saved to /var/cache/conftool/dbconfig/20220523-130954-ladsgroup.json
[13:09:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:32] <taavi>	 urbanecm: looks like the same kind of mystery issue that we saw in T291129 and T304180
[13:12:33] <stashbot>	 T304180: Wikibugs: Quit due to excess flood - https://phabricator.wikimedia.org/T304180
[13:12:33] <stashbot>	 T291129: wikibugs failing to connect when run on exec hosts - https://phabricator.wikimedia.org/T291129
[13:12:40] <urbanecm>	 posible :)
[13:13:51] <koi>	 thanks tgr, no need to test this patch
[13:15:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:18] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:794000|Update IP addresses for Wiki Education Dashboard exemptions (T308702)]] (duration: 00m 52s)
[13:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:22] <stashbot>	 T308702: Update IP addresses for Wiki Education Dashboard exemptions to rate-limiting and global block - https://phabricator.wikimedia.org/T308702
[13:16:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:16:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:16:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28324 and previous config saved to /var/cache/conftool/dbconfig/20220523-131641-ladsgroup.json
[13:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:46] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[13:17:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:50] <tgr>	 koi: IP examption patch is live, zhwiki RC patrol patch is on mwdebug1001
[13:18:00] <koi>	 could a sysadmin have a look at T308976? I could not patrol at zhwiki so couldn't check..
[13:18:02] <stashbot>	 T308976: Enable Recent Changes Patrol for Chinese Wikipedia - https://phabricator.wikimedia.org/T308976
[13:21:50] <koi>	 ping taavi and urbanecm for help ^
[13:22:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:16] <taavi>	 ?
[13:22:43] <tgr>	 what would I look for exactly?
[13:22:54] <koi>	 need to check if every new edits has a "mark for patrol" link
[13:23:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:23:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:23:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:24:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298555)', diff saved to https://phabricator.wikimedia.org/P28325 and previous config saved to /var/cache/conftool/dbconfig/20220523-132438-ladsgroup.json
[13:24:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:44] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[13:24:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28326 and previous config saved to /var/cache/conftool/dbconfig/20220523-132459-ladsgroup.json
[13:25:04] <koi>	 like for this edit, is there a link to mark for patrol at the top (near the timestamp)
[13:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:05] <koi>	 https://zh.wikipedia.org/w/index.php?title=%E4%B8%89%E9%97%96%E5%B0%91%E6%9E%97&type=revision&diff=71783956&oldid=71783920&diffmode=source&uselang=en
[13:25:05] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[13:25:06] <tgr>	 nothing like that jumps out
[13:25:30] <tgr>	 I can verify via shell.php that $wgUseRCPatrol is true
[13:26:06] <taavi>	 +sysadmin doesn't include 'patrol' or 'patrolmarks' needed to see those, only 'autopatrol'
[13:27:02] <koi>	 could you self-grant "patroller" right to yourself to check it
[13:27:24] <James_F>	 No, +sysadmins should never grant themselves rights except in emergencies.
[13:27:29] <tgr>	 staff does have patrolmarks (though not patrol)
[13:28:20] <koi>	 well, anyway let's sync; thought not a big problem
[13:28:29] <tgr>	 ok
[13:29:04] <tgr>	 oh, duh, I had enabled xdebug instead of x-wikimedia-debug
[13:29:11] <tgr>	 ok, I can see the patrol marks
[13:29:52] <koi>	 thanks!
[13:30:29] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:795526|zhwiki: Enable RCPatrol (T308976)]] (duration: 00m 51s)
[13:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:34] <stashbot>	 T308976: Enable Recent Changes Patrol for Chinese Wikipedia - https://phabricator.wikimedia.org/T308976
[13:31:40] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:31:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28327 and previous config saved to /var/cache/conftool/dbconfig/20220523-133146-ladsgroup.json
[13:31:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:01] <tgr>	 koi: patrol is live, itwiki new protection level is on mwdebug1001
[13:32:08] <koi>	 looking
[13:32:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[13:32:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[13:32:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28328 and previous config saved to /var/cache/conftool/dbconfig/20220523-133228-ladsgroup.json
[13:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:34] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[13:33:51] <koi>	 tgr, LGTM
[13:34:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:35:15] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:794590|itwiki: Add "editautopatrolprotected" protection level (T308917)]] (duration: 00m 52s)
[13:35:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:22] <stashbot>	 T308917: Add "editautopatrolprotected" protection level to itwiki - https://phabricator.wikimedia.org/T308917
[13:35:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:21] <tgr>	 koi: protection level is live, rowiki namespace names are on mwdebug1001
[13:39:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P28329 and previous config saved to /var/cache/conftool/dbconfig/20220523-133944-ladsgroup.json
[13:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:51] <koi>	 tgr: LGTM
[13:40:06] <koi>	 please also run namespaceDupes.php
[13:40:59] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793999|rowiki: Use Romanian canonical name (T127607)]] (duration: 00m 50s)
[13:41:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:05] <stashbot>	 T127607: Fix canonical namespaces for rowiki - https://phabricator.wikimedia.org/T127607
[13:41:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:11] <zabe>	 Is it possible to set configs for specific shards? It should be since there is a .dblist file for each shard, right?
[13:42:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:42:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:42:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:03] <tgr>	 koi: doesn't find anything to fix
[13:43:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:43:21] <koi>	 thanks anyway
[13:43:27] <tgr>	 sine the definitions were just swapped between canonical and alias, I guess that's to be expected
[13:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:54] <taavi>	 zabe: should be, yes
[13:44:01] <taavi>	 although I'm quite curious on your use case for that
[13:44:54] <zabe>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/797294
[13:45:36] <logmsgbot>	 !log tgr@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/OAuth/src/Frontend/SpecialPages/SpecialMWOAuthConsumerRegistration.php: Backport: [[gerrit:793795|Remove 'required' from callbackIsPrefix (T308880)]] (duration: 00m 50s)
[13:45:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:42] <stashbot>	 T308880: "callback is prefix" checkbox should not be required during registration - https://phabricator.wikimedia.org/T308880
[13:46:05] <tgr>	 !log EU mid-day deploys done
[13:46:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1135.eqiad.wmnet with OS bullseye
[13:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:19] <tgr>	 I'll test the last one in production, it's a trivial change
[13:46:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28330 and previous config saved to /var/cache/conftool/dbconfig/20220523-134651-ladsgroup.json
[13:46:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 1%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28331 and previous config saved to /var/cache/conftool/dbconfig/20220523-134657-root.json
[13:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:24] <taavi>	 zabe: hmm, the diffConfig job doesn't look as expected
[13:48:11] <zabe>	 hmm, yeah
[13:49:40] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2007.codfw.wmnet
[13:49:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:28] <PeterBowman>	 hello, first time here, I'm going to list https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/789613/ in the calendar for today's late window (T307683)
[13:52:28] <stashbot>	 T307683: Add localized wordmark for plwiktionary - https://phabricator.wikimedia.org/T307683
[13:54:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P28332 and previous config saved to /var/cache/conftool/dbconfig/20220523-135449-ladsgroup.json
[13:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:27] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2007.codfw.wmnet
[13:55:28] <James_F>	 PeterBowman: Welcome! You should crush the SVG file first, please.
[13:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1135.eqiad.wmnet with reason: host reimage
[13:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet
[13:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:08] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2008.codfw.wmnet
[13:57:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1135.eqiad.wmnet with reason: host reimage
[13:59:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet
[14:00:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28334 and previous config saved to /var/cache/conftool/dbconfig/20220523-140156-ladsgroup.json
[14:01:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:01:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:02:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 5%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28335 and previous config saved to /var/cache/conftool/dbconfig/20220523-140201-root.json
[14:02:02] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[14:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:44] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2008.codfw.wmnet
[14:02:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:31] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86]
[14:08:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:37] <stashbot>	 T295072: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072
[14:08:40] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86] (duration: 00m 08s)
[14:08:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298555)', diff saved to https://phabricator.wikimedia.org/P28336 and previous config saved to /var/cache/conftool/dbconfig/20220523-140954-ladsgroup.json
[14:09:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[14:09:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[14:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:58] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[14:10:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298555)', diff saved to https://phabricator.wikimedia.org/P28337 and previous config saved to /var/cache/conftool/dbconfig/20220523-141001-ladsgroup.json
[14:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:33] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86]
[14:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:42] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86] (duration: 00m 08s)
[14:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:14:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1135.eqiad.wmnet with OS bullseye
[14:14:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28338 and previous config saved to /var/cache/conftool/dbconfig/20220523-141705-root.json
[14:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:56] <moritzm>	 !log failover ganeti master in eqiad to ganeti1027
[14:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:22] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2010.codfw.wmnet
[14:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:34] <inflatador>	 !log Add AAAA records to relforge1003 and 1004 T271143
[14:20:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:39] <stashbot>	 T271143: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143
[14:22:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28339 and previous config saved to /var/cache/conftool/dbconfig/20220523-142202-ladsgroup.json
[14:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:09] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[14:23:00] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[14:26:10] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2010.codfw.wmnet
[14:26:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:40] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[14:26:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:39] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2009.codfw.wmnet
[14:30:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28340 and previous config saved to /var/cache/conftool/dbconfig/20220523-143209-root.json
[14:32:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:39] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:14] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:42] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host maps2009.codfw.wmnet
[14:36:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28341 and previous config saved to /var/cache/conftool/dbconfig/20220523-143707-ladsgroup.json
[14:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet
[14:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:34] <icinga-wm>	 PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:50] <icinga-wm>	 RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms
[14:46:43] <_joe_>	 wat
[14:46:55] <_joe_>	 ahh ganeti down
[14:46:56] <_joe_>	 ok
[14:47:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28342 and previous config saved to /var/cache/conftool/dbconfig/20220523-144713-root.json
[14:47:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:44] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1024.eqiad.wmnet
[14:49:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:22] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[14:52:11] <moritzm>	 ^ that'll recover soon, monitoring artefact of the master failover
[14:52:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28343 and previous config saved to /var/cache/conftool/dbconfig/20220523-145212-ladsgroup.json
[14:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:01:51] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.ganeti.makevm for new host netbox1002.eqiad.wmnet
[15:01:52] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.netbox
[15:01:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:10] <Emperor>	 !log rebooting ms-be2069 to look at disk config
[15:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28345 and previous config saved to /var/cache/conftool/dbconfig/20220523-150217-root.json
[15:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:40] <icinga-wm>	 PROBLEM - Host ms-be2069 is DOWN: PING CRITICAL - Packet loss = 100%
[15:06:17] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28346 and previous config saved to /var/cache/conftool/dbconfig/20220523-150717-ladsgroup.json
[15:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:21] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[15:07:43] <jbond>	 Emperor: fyi if you use `sudo cookbook sre.hosts.reboot-single $hostname` it will take care of downtiming the host in icinga
[15:08:20] <Emperor>	 oh, duh, yes, sorry.
[15:08:26] <jbond>	 no problem :)
[15:11:32] <icinga-wm>	 RECOVERY - Host ms-be2069 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms
[15:12:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[15:12:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[15:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28348 and previous config saved to /var/cache/conftool/dbconfig/20220523-151207-ladsgroup.json
[15:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:36] <urbanecm>	 jouncebot: nowandnext
[15:12:36] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 17 minute(s)
[15:12:36] <jouncebot>	 In 0 hour(s) and 17 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T1530)
[15:13:06] <papaul>	 !log poweroff cp2038 for maintenance 
[15:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:05] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netbox1002.eqiad.wmnet
[15:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1184.eqiad.wmnet with OS bullseye
[15:17:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28349 and previous config saved to /var/cache/conftool/dbconfig/20220523-151721-root.json
[15:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:33] <urbanecm>	 jan_drewniak: please wait with your portals deployment until further notice, there is an urgent security issue me and taavi would like to fix before that.
[15:18:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298555)', diff saved to https://phabricator.wikimedia.org/P28350 and previous config saved to /var/cache/conftool/dbconfig/20220523-151826-ladsgroup.json
[15:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:32] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[15:24:45] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:25:37] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[15:26:05] <taavi>	 !log deploy patch for T309028
[15:26:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet
[15:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage
[15:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:05] <jouncebot>	 jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T1530).
[15:30:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:30:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage
[15:32:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet
[15:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:31] <taavi>	 jan_drewniak: we're done, you can proceed as usual
[15:32:56] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS bullseye
[15:33:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28351 and previous config saved to /var/cache/conftool/dbconfig/20220523-153331-ladsgroup.json
[15:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:06] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[15:34:06] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host sretest1001.eqiad.wmnet with OS buster
[15:34:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:15] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster
[15:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:35:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:27] <mbsantos>	 hey, we need to do a deployment for the Wikifeeds service out of the deployment window for the apps fundraising campaign, are there any questions or concerns about doing that nowish?
[15:38:47] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp2038.codfw.wmnet
[15:38:48] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2038.codfw.wmnet
[15:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:13] <vgutierrez>	 !log pool cp2038 - T308459
[15:39:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:18] <stashbot>	 T308459: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459
[15:42:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet
[15:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:42] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[15:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:04] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[15:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:42] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[15:43:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:45] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti5001.eqsin.wmnet with OS bullseye
[15:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:28] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[15:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:57] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[15:45:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:42] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[15:45:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:02] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[15:46:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet
[15:46:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:56] <wikibugs>	 (03PS3) 10Jbond: netbox: add new netbox serveres to netbox::fronend [puppet] - 10https://gerrit.wikimedia.org/r/797329
[15:47:40] <wikibugs>	 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10valhallasw) @Marostegui which command(s) did you run, exactly?   ` tools.wikibugs@tools-sgebastion-10:~$ kubectl get pods NAME                             READY   ST...
[15:48:10] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) After spending some time deploying and testing dispatch in a POC lab environment (dispatch[12].sre-sandbox.eqiad1.wikimedia.cloud), here are my r...
[15:48:14] <wikibugs>	 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10DannyS712) Wikibugs just joined `#wikimedia-operations`
[15:48:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28352 and previous config saved to /var/cache/conftool/dbconfig/20220523-154836-ladsgroup.json
[15:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:50] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[15:48:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1184.eqiad.wmnet with OS bullseye
[15:49:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netbox: add new netbox serveres to netbox::fronend [puppet] - 10https://gerrit.wikimedia.org/r/797329 (owner: 10Jbond)
[15:50:15] <wikibugs>	 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10Marostegui) Thanks @valhallasw! I followed what wikitech mentions. Maybe we should write it clearer so we don't have to bother you again :-)
[15:51:13] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[15:53:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet
[15:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:25] <icinga-wm>	 PROBLEM - Host kubestagetcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:57:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 (10MoritzMuehlenhoff) @ayounsi I've built a backport of fastnetmon 1.2.1 for bullseye-wikimedia. It's not yet uploaded to apt.wikimedia.org, let's sync up for some smoke testing when you're...
[15:57:36] <moritzm>	 ^ kubestagetcd2002 is the ganeti reboot
[15:57:59] <wikibugs>	 (03PS2) 10Muehlenhoff: thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013)
[15:58:51] <wikibugs>	 (03PS4) 10Jbond: netbox: add new netbox serveres to netbox::fronend [puppet] - 10https://gerrit.wikimedia.org/r/797329
[15:59:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet
[15:59:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[16:00:49] <icinga-wm>	 RECOVERY - Host kubestagetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms
[16:01:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28353 and previous config saved to /var/cache/conftool/dbconfig/20220523-160105-ladsgroup.json
[16:01:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:12] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[16:01:21] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster
[16:01:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[16:01:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[16:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:45] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1020.eqiad.wmnet with OS bullseye
[16:01:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host aqs1020.eqiad.wmnet with OS bullseye executed with errors: - aqs1020...
[16:01:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:23] <wikibugs>	 (03PS2) 10Muehlenhoff: klaxon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793773 (https://phabricator.wikimedia.org/T308013)
[16:03:40] <wikibugs>	 (03PS2) 10Muehlenhoff: helm/helmfile: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793774 (https://phabricator.wikimedia.org/T308013)
[16:03:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298555)', diff saved to https://phabricator.wikimedia.org/P28354 and previous config saved to /var/cache/conftool/dbconfig/20220523-160341-ladsgroup.json
[16:03:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[16:03:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[16:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:47] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[16:03:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] netbox: add new netbox serveres to netbox::fronend [puppet] - 10https://gerrit.wikimedia.org/r/797329 (owner: 10Jbond)
[16:06:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] klaxon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793773 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[16:06:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:10:56] <wikibugs>	 (03PS1) 10Jbond: P:netbox::automation: Drop Acme dependency [puppet] - 10https://gerrit.wikimedia.org/r/797338 (https://phabricator.wikimedia.org/T296452)
[16:11:27] <wikibugs>	 (03PS1) 10Zabe: toolforge: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797339 (https://phabricator.wikimedia.org/T308013)
[16:13:07] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[16:13:51] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5001.eqsin.wmnet with reason: host reimage
[16:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:03] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:15:25] <wikibugs>	 (03PS1) 10Zabe: tomcat: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797341 (https://phabricator.wikimedia.org/T308013)
[16:16:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28355 and previous config saved to /var/cache/conftool/dbconfig/20220523-161610-ladsgroup.json
[16:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet
[16:16:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:00] <wikibugs>	 (03PS3) 10Zabe: postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673)
[16:17:07] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5001.eqsin.wmnet with reason: host reimage
[16:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:33] <wikibugs>	 (03CR) 10STran: [C: 03+1] Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597) (owner: 10Tchanders)
[16:19:01] <wikibugs>	 (03CR) 10STran: [C: 03+1] Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders)
[16:19:01] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:19:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:netbox::automation: Drop Acme dependency [puppet] - 10https://gerrit.wikimedia.org/r/797338 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[16:20:54] <DannyS712>	 does anyone know where I could find documentation for $wgDontNotUnDisenableInstantCommons ?
[16:21:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet
[16:21:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:08] <taavi>	 DannyS712: I don't think that's a thing - https://codesearch.wmcloud.org/search/?q=DontNotUnDisenableInstantCommons&i=nope&files=&excludeFiles=&repos=
[16:22:42] <DannyS712>	 https://bash.toolforge.org/quip/AU7VVSDt6snAnmqnK_wG suggests it is :)
[16:23:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:23:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:11] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:28:19] <inflatador>	 !log adding AAAA records for cloudelastic100[1-6] T271143
[16:28:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:23] <stashbot>	 T271143: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143
[16:29:10] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:29] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/797127 (https://phabricator.wikimedia.org/T296982) (owner: 10Elukey)
[16:31:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28356 and previous config saved to /var/cache/conftool/dbconfig/20220523-163116-ladsgroup.json
[16:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:28] <inflatador>	 ^^ Ignore my earlier log msg, cloudelastic already has AAAA records
[16:38:54] <wikibugs>	 (03PS1) 10Volans: sre.dns.netbox: limit matching hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/797343 (https://phabricator.wikimedia.org/T296452)
[16:39:09] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5001.eqsin.wmnet with OS bullseye
[16:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti5001.eqsin.wmnet with OS bullseye completed: - ganeti5001 (**PASS**)   - Downtimed on Icinga/Ale...
[16:39:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/797343 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans)
[16:40:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH) a:05RobH→03MoritzMuehlenhoff @MoritzMuehlenhoff,  per our earlier IRC discussion, ganeti5001 has had all the firmware updated and reimaged successfully.  All yours!
[16:41:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH)
[16:44:16] <inflatador>	 !log add AAAA records to elastic202[5-9] T271143
[16:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:22] <stashbot>	 T271143: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143
[16:46:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28357 and previous config saved to /var/cache/conftool/dbconfig/20220523-164621-ladsgroup.json
[16:46:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:27] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[16:47:58] <wikibugs>	 (03PS1) 10Ladsgroup: db1106: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/797344 (https://phabricator.wikimedia.org/T303171)
[16:48:30] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "db1184: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797217
[16:48:49] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "db1184: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797217
[16:49:30] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] db1106: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/797344 (https://phabricator.wikimedia.org/T303171) (owner: 10Ladsgroup)
[16:49:52] <wikibugs>	 (03PS3) 10Ladsgroup: Revert "db1184: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797217
[16:49:56] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1184: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797217 (owner: 10Ladsgroup)
[16:50:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[16:50:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[16:50:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:50:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[16:50:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28358 and previous config saved to /var/cache/conftool/dbconfig/20220523-165045-ladsgroup.json
[16:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:36] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] "zuul stuck with the queue, trivial urgent change to unblock work." [cookbooks] - 10https://gerrit.wikimedia.org/r/797343 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans)
[16:52:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[16:52:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:46] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen)
[16:59:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:59:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1106.eqiad.wmnet with OS bullseye
[16:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:04] <jouncebot>	 ryankemper: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T1700).
[17:00:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[17:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/797341 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[17:04:06] <wikibugs>	 (03Restored) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (owner: 10Winston Sung)
[17:04:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] tomcat: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797341 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[17:04:14] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:33] <wikibugs>	 (03PS3) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416
[17:04:39] <wikibugs>	 (03Abandoned) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (owner: 10Winston Sung)
[17:04:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe)
[17:06:18] <wikibugs>	 (03Restored) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (owner: 10Winston Sung)
[17:06:23] <wikibugs>	 (03PS3) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608
[17:06:28] <wikibugs>	 (03Abandoned) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (owner: 10Winston Sung)
[17:06:35] <wikibugs>	 (03Restored) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (owner: 10Winston Sung)
[17:06:47] <wikibugs>	 (03PS4) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610
[17:06:51] <wikibugs>	 (03Abandoned) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (owner: 10Winston Sung)
[17:07:41] <wikibugs>	 (03PS2) 10Jbond: P:ssh::client: Add GSSAPIDelegateCredentials support to ssh::client [puppet] - 10https://gerrit.wikimedia.org/r/791567
[17:08:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1106.eqiad.wmnet with reason: host reimage
[17:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:15] <wikibugs>	 (03Restored) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (owner: 10Winston Sung)
[17:09:26] <wikibugs>	 (03PS3) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417
[17:09:36] <wikibugs>	 (03Abandoned) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (owner: 10Winston Sung)
[17:09:45] <wikibugs>	 (03Restored) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (owner: 10Winston Sung)
[17:09:48] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service,rq-netbox.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:58] <wikibugs>	 (03PS4) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418
[17:10:02] <wikibugs>	 (03Abandoned) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (owner: 10Winston Sung)
[17:10:27] <wikibugs>	 (03Restored) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (owner: 10Winston Sung)
[17:10:48] <wikibugs>	 (03PS5) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606
[17:10:53] <wikibugs>	 (03Abandoned) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (owner: 10Winston Sung)
[17:11:03] <wikibugs>	 (03Restored) 10Winston Sung: Add tests closer to real use cases for Special:MyLanguage [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611 (https://phabricator.wikimedia.org/T278639) (owner: 10Winston Sung)
[17:11:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1106.eqiad.wmnet with reason: host reimage
[17:11:12] <wikibugs>	 (03PS3) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611
[17:11:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:17] <wikibugs>	 (03Abandoned) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611 (owner: 10Winston Sung)
[17:13:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10valhallasw)
[17:14:18] <wikibugs>	 (03PS1) 10Papaul: ADd DNS for new frbackuup node [dns] - 10https://gerrit.wikimedia.org/r/797347
[17:19:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab: reduce backup_keep_time to 2d [puppet] - 10https://gerrit.wikimedia.org/r/797278 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[17:19:56] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] ADd DNS for new frbackuup node [dns] - 10https://gerrit.wikimedia.org/r/797347 (owner: 10Papaul)
[17:23:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Papaul)
[17:26:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1106.eqiad.wmnet with OS bullseye
[17:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:24] <wikibugs>	 (03PS5) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942)
[17:34:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28359 and previous config saved to /var/cache/conftool/dbconfig/20220523-173439-ladsgroup.json
[17:34:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:45] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[17:36:54] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:36:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:03] <wikibugs>	 (03PS1) 10Samtar: changeprop: Remove WP:ANI from page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/797354 (https://phabricator.wikimedia.org/T274359)
[17:47:29] <wikibugs>	 (03PS1) 10Zabe: toil: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797355 (https://phabricator.wikimedia.org/T308013)
[17:49:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Papaul) switch configuration  ` papaul@fasw-c-codfw# run show interfaces ge-[0-1]/0/11 descriptions Interface       Admin Link Description ge-0/0/11       u...
[17:49:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P28360 and previous config saved to /var/cache/conftool/dbconfig/20220523-174944-ladsgroup.json
[17:49:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:19] <wikibugs>	 (03PS1) 10Zabe: tmpreaper: Add SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/797362 (https://phabricator.wikimedia.org/T308013)
[17:53:39] <wikibugs>	 (03CR) 10Samtar: "Again, my uninformed test plan is T274359#7751644, but in this case *just* testing https://en.wikipedia.org/wiki/Wikipedia:Administrators%" [deployment-charts] - 10https://gerrit.wikimedia.org/r/797354 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar)
[17:53:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Jdlrobson)
[17:54:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Papaul)
[17:56:03] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Papaul) a:05Papaul→03Jgreen @Jgreen all yours
[17:56:12] <wikibugs>	 (03PS1) 10Zabe: threedtopng: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797366 (https://phabricator.wikimedia.org/T308013)
[18:00:27] <wikibugs>	 (03CR) 10STran: [C: 03+1] Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders)
[18:01:40] <wikibugs>	 (03CR) 10STran: [C: 03+1] Remove outdated comment about IPInfo from CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793848 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders)
[18:03:45] <wikibugs>	 (03CR) 10Dzahn: "the team owning the license for these is the Anti Harassment team fwiw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders)
[18:04:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P28361 and previous config saved to /var/cache/conftool/dbconfig/20220523-180449-ladsgroup.json
[18:04:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:56] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@d1f4367]: T307983: weekly import of image suggestions
[18:06:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:01] <stashbot>	 T307983: Write search index data for image suggestions into a hive table rather than local hdfs files - https://phabricator.wikimedia.org/T307983
[18:07:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[18:07:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[18:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:17] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@d1f4367]: T307983: weekly import of image suggestions (duration: 02m 21s)
[18:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:21] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "db1106: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797218
[18:11:26] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "db1106: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797218
[18:11:45] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1106: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797218 (owner: 10Ladsgroup)
[18:19:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28364 and previous config saved to /var/cache/conftool/dbconfig/20220523-181954-ladsgroup.json
[18:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:00] <stashbot>	 T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171
[18:25:16] <ryankemper>	 !log T308647 Bringing `elastic2054` back into service: `ryankemper@elastic2054:~$ sudo pool` (it's not currently banned from cluster so nothing to do there)
[18:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:22] <stashbot>	 T308647: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647
[18:25:59] <wikibugs>	 10SRE, 10ops-codfw, 10CirrusSearch, 10DC-Ops, 10Discovery-Search: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 (10RKemper) Thanks for looking into this, all. I've brought the host back into service and will reopen the ticket if problems re-surface, but for now things look...
[18:31:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Sgs)
[18:32:32] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@2d8e8d1]: (no justification provided)
[18:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:40] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@2d8e8d1]: (no justification provided) (duration: 00m 07s)
[18:32:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10MShilova_WMF) I confirm that @sgs needs access to a production server and it is currently blocking {https://phabricator.wikimedia.org/T307454}. More context for that task can be...
[18:39:14] <TheresNoTime>	 Hey all, I have a feeling https://tools-prometheus.wmflabs.org/tools/api/v1/query_range isn't meant to be returning a 503 (was trying to figure out why https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&refresh=5m&var-namespace=tool-refill wasn't loading)
[18:39:39] <TheresNoTime>	 of course it starts working now
[18:42:30] <wikibugs>	 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) AAAA records successfully added for elastic202[5-9]: ` for n in $(cat codfw.hosts); do quad=$(dig aaaa +short ${n});pri...
[18:48:38] <wikibugs>	 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Gehel)
[18:53:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:57:09] <wikibugs>	 (03CR) 10Tchanders: Add comment to consult Legal before updating IPInfo access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders)
[19:02:41] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:05:11] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:05:37] <urbanecm>	 TheresNoTime: fyi, issues like the one you raised are best noted in -cloud :)
[19:09:39] <wikibugs>	 (03PS1) 10Majavah: nrpe: manage sudo rules via nrpe::check [puppet] - 10https://gerrit.wikimedia.org/r/797422
[19:11:48] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35505/console" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[19:12:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:12:30] <wikibugs>	 (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/797423 (https://phabricator.wikimedia.org/T303515)
[19:12:44] <wikibugs>	 (03PS1) 10Clare Ming: Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607)
[19:12:52] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/797423 (https://phabricator.wikimedia.org/T303515) (owner: 10Kosta Harlan)
[19:14:36] <wikibugs>	 (03CR) 10Dzahn: "forgive my ignorance but wouldn't it be much easier to have the same base class or " include ::nrpe" as every machine in prod instead of i" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[19:15:59] <wikibugs>	 (03PS2) 10Clare Ming: Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607)
[19:16:54] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[19:17:09] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@5a4803a]: T307983: zero-pad dates within @dailysnapshot
[19:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:16] <stashbot>	 T307983: Write search index data for image suggestions into a hive table rather than local hdfs files - https://phabricator.wikimedia.org/T307983
[19:17:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:18:03] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/797423 (https://phabricator.wikimedia.org/T303515) (owner: 10Kosta Harlan)
[19:19:30] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@5a4803a]: T307983: zero-pad dates within @dailysnapshot (duration: 02m 20s)
[19:19:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:51] <logmsgbot>	 !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply
[19:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:17] <logmsgbot>	 !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply
[19:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:36] <logmsgbot>	 !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply
[19:23:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:03] <logmsgbot>	 !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply
[19:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:58] <logmsgbot>	 !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply
[19:27:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:57] <icinga-wm>	 PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:29:08] <logmsgbot>	 !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply
[19:29:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:36] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] nrpe: manage sudo rules via nrpe::check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[19:40:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2105.codfw.wmnet with reason: Maintenance
[19:40:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2105.codfw.wmnet with reason: Maintenance
[19:40:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 6 hosts with reason: Maintenance
[19:40:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 6 hosts with reason: Maintenance
[19:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10SCherukuwada) Manager is OOO. Skip-level Manager here, approved (if needed).
[19:46:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[19:46:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[19:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28366 and previous config saved to /var/cache/conftool/dbconfig/20220523-194659-ladsgroup.json
[19:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:06] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T2000)
[20:00:06] <jouncebot>	 James_F, DannyS712, koi, PeterBowman, and zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:29] <zabe>	 hey o/
[20:00:33] <koi>	 hi
[20:00:38] <PeterBowman>	 hello
[20:01:58] <cjming>	 hi - I can deploy
[20:02:23] <cjming>	 James_F: are you around?
[20:03:02] <cjming>	 DannyS712: are you around?
[20:03:49] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:04:21] <wikibugs>	 (03PS2) 10Clare Ming: commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407) (owner: 10Stang)
[20:05:24] <cjming>	 koi: I'll start with your patch since the folks ahead of you haven't responded yet
[20:05:31] <koi>	 ok
[20:06:15] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407) (owner: 10Stang)
[20:06:22] <James_F>	 cjming: Sorry, yes, arround.
[20:06:23] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:06:35] <James_F>	 I was expecting to deploy myself, but you can go ahead if you wish. :-)
[20:07:04] <cjming>	 James_F: sorry about that! i'll let you self-serve after i get this first patch off - i'll ping you when i'm done
[20:07:12] <James_F>	 Sure!
[20:07:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:07:33] <wikibugs>	 (03Merged) 10jenkins-bot: commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407) (owner: 10Stang)
[20:08:31] <cjming>	 koi: can your patch be verified on mwdebug1001?
[20:08:43] <koi>	 I think so, looking
[20:08:44] <cjming>	 1 sec - forgot to rebase
[20:09:00] <cjming>	 koi: ok you can check now
[20:09:56] <James_F>	 cjming: You should deploy everyone else's changes before mine, so I don't hold anyone else up.
[20:10:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance
[20:10:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance
[20:10:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance
[20:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance
[20:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:28] <cjming>	 James_F: alrighty - should be done here quick
[20:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:41] <wikibugs>	 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking)
[20:10:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:10:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen)
[20:11:22] <cjming>	 koi: gtg?
[20:11:30] <koi>	 still testing
[20:11:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:11:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:06] <koi>	 cjming, LGTM
[20:12:13] <cjming>	 great - syncing
[20:13:11] <cjming>	 PeterBowman: can you rebase your patch? i tried thru gerrit but it seems to need a manual rebase
[20:13:16] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793766|commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig (T300407)]] (duration: 00m 52s)
[20:13:19] <PeterBowman>	 sure
[20:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:21] <stashbot>	 T300407: Allow managing upload-by-url allowlist as a system message - https://phabricator.wikimedia.org/T300407
[20:13:33] <cjming>	 koi: your patch should be live
[20:14:01] <PeterBowman>	 cjming I need some time, can you please continue with other patches in the meantime? I also need to log out
[20:14:15] <cjming>	 PeterBowman: sure - np
[20:14:21] <PeterBowman>	 see you soon
[20:14:29] <cjming>	 Zabe: your next
[20:14:37] <cjming>	 *you're
[20:14:46] <zabe>	 ok
[20:14:51] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Start writing to cuc_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797312 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:15:01] <wikibugs>	 (03PS2) 10Clare Ming: Start writing to cuc_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797312 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:16:32] <zabe>	 cjming, you need to re +2 it, in order to kick the gate-and-submit job again since you rebased it after giving the +2
[20:16:51] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Start writing to cuc_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797312 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:16:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:17:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:09] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to cuc_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797312 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:18:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:44] <cjming>	 Zabe: is your change testable? on mwdebug1001
[20:19:04] <wikibugs>	 (03PS4) 10Peter Bowman: Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683)
[20:20:06] <zabe>	 cjming, lgtm. It's only test wikis so making sure that editing doesn't fatal should be enough.
[20:20:12] <cjming>	 sounds good - syncing
[20:20:39] <wikibugs>	 (03PS5) 10Clare Ming: Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) (owner: 10Peter Bowman)
[20:21:17] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:797312|Start writing to cuc_actor in test wikis (T233004)]] (duration: 00m 50s)
[20:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:23] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[20:21:27] <cjming>	 Zabe: your patch is live
[20:21:35] <zabe>	 thanks :)
[20:21:52] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) (owner: 10Peter Bowman)
[20:22:01] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@aa49833]: increase memory_overhead for convert_to_esbulk
[20:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:25] <wikibugs>	 (03PS3) 10Zabe: Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004)
[20:23:29] <wikibugs>	 (03Merged) 10jenkins-bot: Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) (owner: 10Peter Bowman)
[20:23:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:23:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:25] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@aa49833]: increase memory_overhead for convert_to_esbulk (duration: 02m 24s)
[20:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:37] <cjming>	 PeterBowman: can you check mwdebug1001?
[20:24:53] <PeterBowman>	 sorry, first time here, how can I access that?
[20:25:23] <PeterBowman>	 I found instructions to ssh, but this is an interface change
[20:25:55] <James_F>	 PeterBowman: You need to use a browser extension to get your browser to read the production wikis using mwdebug1001 rather than a regular server.
[20:26:02] <James_F>	 PeterBowman: Don't worry about it, I can validate.
[20:26:06] <cjming>	 there's a browser extension WikimediaDebug that allows you to check changes on the server
[20:26:24] <cjming>	 thanks @James_F
[20:26:40] <PeterBowman>	 oops, I'll remember that for the next time :| thank you James_F
[20:26:43] <James_F>	 cjming: And yes, it's working.
[20:26:48] <cjming>	 cool - syncing then
[20:26:52] <James_F>	 PeterBowman: No worries. It's all a bit too complicated, frankly.
[20:27:50] <logmsgbot>	 !log cjming@deploy1002 Synchronized static/images/mobile/copyright/wiktionary-wordmark-pl.svg: Config: [[gerrit:789613|Add localized wordmark for plwiktionary (T307683)]] (duration: 00m 50s)
[20:27:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:56] <stashbot>	 T307683: Add localized wordmark for plwiktionary - https://phabricator.wikimedia.org/T307683
[20:27:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:27:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:44] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:789613|Add localized wordmark for plwiktionary (T307683)]] (duration: 00m 51s)
[20:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:51] <cjming>	 PeterBowman: James_F: change should be live
[20:28:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:29:01] <PeterBowman>	 yes I see it, thank you all! :)
[20:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:57] <cjming>	 James_F: go ahead with your patches -- can you let me know when you're done?  i have a config change I want to do as well (not quite ready yet)
[20:30:03] <James_F>	 Sure!
[20:30:11] <wikibugs>	 (03PS3) 10Jforrester: Drop CodeReview, Part I: Stop loading it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593350 (https://phabricator.wikimedia.org/T116948)
[20:30:21] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "The time is nigh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593350 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester)
[20:30:32] <cjming>	 DannyS712: if/when you're here, lmk and we can do your patch
[20:31:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen)
[20:32:05] <wikibugs>	 (03Merged) 10jenkins-bot: Drop CodeReview, Part I: Stop loading it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593350 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester)
[20:32:30] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Enable cassandra encryption (aqs cluster) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans)
[20:33:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen)
[20:34:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen)
[20:34:12] <logmsgbot>	 !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:593350|Drop CodeReview, Part I: Stop loading it anywhere (T116948)]] (duration: 00m 51s)
[20:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:18] <stashbot>	 T116948: Undeploy CodeReview - https://phabricator.wikimedia.org/T116948
[20:34:41] <wikibugs>	 (03PS3) 10Jforrester: Drop CodeReview, Part II: Stop configuring it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593351 (https://phabricator.wikimedia.org/T116948)
[20:34:45] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Drop CodeReview, Part II: Stop configuring it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593351 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester)
[20:34:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:34:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:35:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:13] <wikibugs>	 (03Merged) 10jenkins-bot: Drop CodeReview, Part II: Stop configuring it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593351 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester)
[20:37:04] <wikibugs>	 (03PS3) 10Jforrester: Drop CodeReview, Part III: Drop from i18n build step [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593352 (https://phabricator.wikimedia.org/T116948)
[20:37:24] <logmsgbot>	 !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:593351|Drop CodeReview, Part II: Stop configuring it anywhere (T116948)]] (duration: 00m 51s)
[20:37:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:34] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Drop CodeReview, Part III: Drop from i18n build step [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593352 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester)
[20:39:05] <wikibugs>	 (03Merged) 10jenkins-bot: Drop CodeReview, Part III: Drop from i18n build step [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593352 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester)
[20:40:12] <logmsgbot>	 !log jforrester@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:593352|Drop CodeReview, Part III: Drop from i18n build step (T116948)]] (duration: 00m 51s)
[20:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:17] <stashbot>	 T116948: Undeploy CodeReview - https://phabricator.wikimedia.org/T116948
[20:40:36] <James_F>	 cjming: OK, all done!
[20:40:44] <cjming>	 great - thanks
[20:40:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:40:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:00] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming)
[20:41:06] <wikibugs>	 (03PS3) 10Clare Ming: Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607)
[20:41:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:41:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:50] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming)
[20:42:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:42:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:58] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming)
[20:44:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen)
[20:46:06] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@2f7ddb1]: increase driver memory_overhead for convert_to_esbulk
[20:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:31] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:797424|Deploy TOC A/B test to frwiki, ptwiki at 50% (T306607)]] (duration: 00m 52s)
[20:47:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:35] <stashbot>	 T306607: Deploy ToC A/B test to remainder of desktop improvements pilot wikis - https://phabricator.wikimedia.org/T306607
[20:47:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:47:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:26] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@2f7ddb1]: increase driver memory_overhead for convert_to_esbulk (duration: 02m 20s)
[20:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:48:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:03] <cjming>	 !log end of UTC late backport window
[20:49:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T2100).
[21:00:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:26] <wikibugs>	 (03CR) 10Dzahn: nrpe: manage sudo rules via nrpe::check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah)
[21:03:26] <wikibugs>	 (03CR) 10Yahya: [C: 03+1] Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil)
[21:03:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Ejegg)
[21:03:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[21:03:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[21:03:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298555)', diff saved to https://phabricator.wikimedia.org/P28367 and previous config saved to /var/cache/conftool/dbconfig/20220523-210339-ladsgroup.json
[21:03:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:45] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[21:04:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:04:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:04:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:59] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:23:12] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Set fixed uid/gid for kafka by default [puppet] - 10https://gerrit.wikimedia.org/r/797127 (https://phabricator.wikimedia.org/T296982) (owner: 10Elukey)
[21:23:48] <TheresNoTime>	 https://phabricator.wikimedia.org 503ing for me
[21:23:49] <Tamzin>	 503
[21:23:54] <Tamzin>	 gah you beat me by like a sec
[21:24:02] <zabe>	 for wikis aswell
[21:24:04] <addshore>	 503 Service Unavailable :P
[21:24:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alerts: take rule file site into consideration when deploying [puppet] - 10https://gerrit.wikimedia.org/r/797237 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[21:24:11] <addshore>	 yeah, mw.org is down for me
[21:24:16] <TheresNoTime>	 that's a #page
[21:24:26] <rzl>	 thanks, looking
[21:24:47] <cdanis>	 thanks, looking
[21:24:49] <TheresNoTime>	 its fine I didn't want to look at phab anyway /s
[21:24:52] <brett>	 phab is up for me (oregon)
[21:24:57] <Vermont>	 :(
[21:25:03] <mutante>	 phab is working. here
[21:25:11] <zabe>	 not for me (europe)
[21:25:14] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/797201 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[21:25:17] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:25:17] <addshore>	 Not for me UK
[21:25:19] <jinxer-wm>	 (ProbeDown) firing: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:25:19] <jinxer-wm>	 (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:25:27] <mutante>	 jouncebot: now
[21:25:27] <jouncebot>	 For the next 1 hour(s) and 34 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T2100)
[21:25:29] <TheresNoTime>	 jinxer-wm: 2slow4me
[21:25:35] <Tamzin>	 not for me, Eastern U.S. Nor enwiki Main Page
[21:25:37] <mutante>	 is this a deployment ? ^
[21:25:47] <Vermont>	 enwiki is down for me, ticket.wm too
[21:26:01] <icinga-wm>	 PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:26:18] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:26:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:26:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:26:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:26:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:26:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:26:31] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[21:26:37] <icinga-wm>	 PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:27:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:27:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:27:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:27:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:27:19] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 18.08 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[21:27:37] <icinga-wm>	 PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 233 bytes in 0.329 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:27:40] <icinga-wm>	 PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string Picture of the day not found on https://commons.wikimedia.org:443/wiki/Main_Page - 233 bytes in 0.005 second response time https://phabricator.wikimedia.org/project/view/1118/
[21:27:41] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[21:27:42] <icinga-wm>	 PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:27:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:27:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:27:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp107
[21:27:59] <icinga-wm>	 wmnet are marked down but pooled: testlb6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:28:00] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
[21:28:01] <icinga-wm>	 PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:28:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:28:02] <icinga-wm>	 PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:28:02] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp501
[21:28:02] <icinga-wm>	 wmnet are marked down but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wi
[21:28:05] <TheresNoTime>	 don't think its a deployment, but can't check SAL
[21:28:20] <icinga-wm>	 PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:28:23] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 6.209 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[21:28:23] <RhinosF1>	 Down here in UK
[21:28:25] <icinga-wm>	 PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:28:30] <icinga-wm>	 PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:28:33] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[21:28:34] <icinga-wm>	 PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:28:37] <icinga-wm>	 PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:28:44] <icinga-wm>	 PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:28:58] <icinga-wm>	 PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:28:58] <icinga-wm>	 PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/Debmonitor
[21:28:59] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[21:29:06] <icinga-wm>	 PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:29:09] <icinga-wm>	 PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[21:29:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp305
[21:29:09] <icinga-wm>	 wmnet are marked down but pooled: testlb6_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3058.esams.wmnet, cp3062.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:29:14] <icinga-wm>	 PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:29:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are ma
[21:29:14] <icinga-wm>	 n but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5007.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:29:15] <icinga-wm>	 PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 233 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:29:22] <icinga-wm>	 PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:29:35] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.9722 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[21:29:37] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[21:29:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:29:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1367.eqiad.wmnet, mw1414.eqiad.wmnet, mw1332.eqiad.wmnet, mw1371.eqiad.wmnet, mw1455.eqiad.wmnet, mw1442.eqiad.wmnet, mw1395.eqiad.wmnet, mw1434.eqiad.wmnet, mw1322.eqiad.wmnet, mw1355.eqiad.wmnet, mw1323.eqiad.wmnet, mw1384.eqiad.wmnet, mw1454.eqiad.wmnet, mw1327.eqiad.wmnet, mw1328.eqiad.wmnet, mw1413.eqiad.wmnet, mw
[21:29:43] <icinga-wm>	 ad.wmnet, mw1393.eqiad.wmnet, mw1351.eqiad.wmnet, mw1416.eqiad.wmnet, mw1405.eqiad.wmnet, mw1352.eqiad.wmnet, mw1432.eqiad.wmnet, mw1441.eqiad.wmnet, mw1333.eqiad.wmnet, mw1326.eqiad.wmnet, mw1435.eqiad.wmnet, mw1420.eqiad.wmnet, mw1418.eqiad.wmnet, mw1319.eqiad.wmnet, mw1407.eqiad.wmnet, mw1366.eqiad.wmnet, mw1324.eqiad.wmnet, mw1372.eqiad.wmnet, mw1391.eqiad.wmnet, mw1370.eqiad.wmnet, mw1429.eqiad.wmnet, mw1451.eqiad.wmnet, mw1331.eqiad
[21:29:43] <icinga-wm>	 mw1401.eqiad.wmnet, mw1397.eqiad.wmnet, mw1325.eqiad.wmnet, mw1403.eqiad.wmnet, mw1373.eqiad.wmnet, mw1385.eqiad.wmnet, mw1369.eqiad.wmnet, mw1419.eqiad.wmnet, mw1387.eqiad.wmnet, mw135 https://wikitech.wikimedia.org/wiki/PyBal
[21:29:47] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1433.eqiad.wmnet, mw1365.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1366.eqiad.wmnet, mw1432.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1387.eqiad.wmnet, mw1430.eqiad.wmnet, mw1416.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw
[21:29:47] <icinga-wm>	 ad.wmnet, mw1420.eqiad.wmnet, mw1333.eqiad.wmnet, mw1393.eqiad.wmnet, mw1454.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, mw1395.eqiad.wmnet, mw1397.eqiad.wmnet, mw1325.eqiad.wmnet, mw1385.eqiad.wmnet, mw1417.eqiad.wmnet, mw1367.eqiad.wmnet, mw1373.eqiad.wmnet, mw1455.eqiad.wmnet, mw1436.eqiad.wmnet, mw1332.eqiad.wmnet, mw1452.eqiad.wmnet, mw1414.eqiad.wmnet, mw1369.eqiad.wmnet, mw1371.eqiad.wmnet, mw1453.eqiad
[21:29:47] <icinga-wm>	 mw1322.eqiad.wmnet, mw1319.eqiad.wmnet, mw1323.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1351.eqiad.wmnet, mw1391.eqiad.wmnet, mw1352.eqiad.wmnet, mw144 https://wikitech.wikimedia.org/wiki/PyBal
[21:30:07] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:30:12] <icinga-wm>	 RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18860 bytes in 1.082 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:30:13] <icinga-wm>	 RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 182039 bytes in 0.012 second response time https://phabricator.wikimedia.org/project/view/1118/
[21:30:13] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[21:30:19] <icinga-wm>	 RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:30:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.457 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:30:22] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 39622 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Phabricator
[21:30:23] <icinga-wm>	 RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 0.589 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:30:25] <icinga-wm>	 RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:30:25] <icinga-wm>	 RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:30:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:30:44] <icinga-wm>	 RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 0.545 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:30:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:30:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:30:52] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[21:30:54] <icinga-wm>	 RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:31:00] <jinxer-wm>	 (Wikidata Reliability Metrics - Median Payload alert) firing: Wikidata Reliability Metrics - Median Payload alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert
[21:31:00] <icinga-wm>	 RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18859 bytes in 0.523 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:31:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:31:08] <icinga-wm>	 RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18860 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:31:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:31:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:31:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:31:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:31:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:31:22] <icinga-wm>	 RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18860 bytes in 0.472 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:31:24] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:31:29] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[21:31:29] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[21:31:33] <icinga-wm>	 RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1634 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[21:31:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:31:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:31:40] <icinga-wm>	 RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:31:45] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:31:50] <icinga-wm>	 RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18884 bytes in 0.536 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:31:51] <icinga-wm>	 RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18859 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:32:07] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[21:32:09] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:32:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:32:11] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06944 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[21:32:12] <jinxer-wm>	 (ProbeDown) firing: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:32:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:32:13] <addshore>	 back for me
[21:32:13] <icinga-wm>	 RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:32:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[21:32:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:32:21] <giraffe>	 yep i'm fine
[21:32:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:32:27] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 72.07 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[21:32:31] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:33:21] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 95.86 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[21:33:29] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[21:33:51] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 4.451e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007
[21:33:55] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[21:33:55] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 4.337e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011
[21:34:05] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5014 is CRITICAL: 4.6e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014
[21:34:11] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: 4.627e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009
[21:35:04] <icinga-wm>	 RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 6.163 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:35:09] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5008 is CRITICAL: 5.23e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008
[21:35:11] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:35:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:35:47] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[21:36:16] <icinga-wm>	 RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18859 bytes in 1.306 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:36:20] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:36:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:36:57] <jinxer-wm>	 (ProbeDown) firing: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:37:02] <jinxer-wm>	 (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:37:30] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:37:31] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5008 is OK: (C)5000 gt (W)3000 gt 320.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008
[21:38:31] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 408.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007
[21:38:35] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 386.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011
[21:38:43] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5014 is OK: (C)5000 gt (W)3000 gt 293.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014
[21:38:49] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: (C)5000 gt (W)3000 gt 379.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009
[21:38:55] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:39:45] <jinxer-wm>	 (JobUnavailable) resolved: (4) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:40:19] <jinxer-wm>	 (ProbeDown) resolved: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:40:19] <jinxer-wm>	 (ProbeDown) resolved: (21) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:40:36] <Amir1>	 what just happened?
[21:40:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host relforge1003.eqiad.wmnet with OS bullseye
[21:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:10] <addshore>	 Amir1: a few minutes of downtime :P
[21:41:50] <Amir1>	 addshore: according to alert it was only one minute :D
[21:42:05] <perryprog>	 Interesting, looks like that makes https://www.wikimediastatus.net automatically add an "Errors for many users" incident
[21:43:14] <zabe>	 it was definetly more than a minute
[21:43:18] <mutante>	 !log [cumin1001:~] $ sudo systemctl start httpbb_hourly_appserver
[21:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:28] <mutante>	 zabe: about 8 or so
[21:44:25] <TheresNoTime>	 mutante: so it was `PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service` ?
[21:45:31] * addshore goes back to what he was doing
[21:45:43] <mutante>	 TheresNoTime: that is failing for an unrelated reason
[21:45:51] <TheresNoTime>	 ah (: 
[21:46:02] * TheresNoTime has forgotten what they were doing now
[21:46:23] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:46:29] <mutante>	 TheresNoTime: it's because https://www.mediawiki.org/w/index.php?title=Special:CodeReview&path=foo is 404 and not 302 (those redirects for CodeReview)
[21:47:57] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:48:03] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) Ryan Kemper T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:48:03] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) Ryan Kemper T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:48:04] <TheresNoTime>	 as in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/593352/3/wmf-config/extension-list ?
[21:48:11] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: set dlq output and template_version [puppet] - 10https://gerrit.wikimedia.org/r/777888 (https://phabricator.wikimedia.org/T305088) (owner: 10Cwhite)
[21:49:32] <jinxer-wm>	 (Wikidata Reliability Metrics - Median Payload alert) resolved: Wikidata Reliability Metrics - Median Payload alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert
[21:50:46] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1003.eqiad.wmnet with reason: host reimage
[21:50:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:53] <mutante>	 TheresNoTime: sounds likely. but whatever it says on https://phabricator.wikimedia.org/T205361  afaict
[21:54:35] <mutante>	 legoktm: is it expected that Special:Code is gone?
[21:54:40] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1003.eqiad.wmnet with reason: host reimage
[21:54:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:57] <legoktm>	 mutante: yes, James_F undeployed it earlier. Some but not all of the redirects are in place
[21:55:29] <legoktm>	 e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/774943/
[21:55:30] <mutante>	 legoktm: it triggered alerts because we did not remove the tests before actually undeploying it. I will fix that now though
[21:55:34] <perryprog>	 it does redirect on mediawiki, but not e.g., enwiki
[21:55:37] <legoktm>	 thanks!
[21:55:50] <legoktm>	 perryprog: Special:Code never existed on any other wiki besides mw.o
[21:55:58] <perryprog>	 🤦‍♂️ ah
[21:56:11] <legoktm>	 mutante: you can tag any patches with T116948
[21:56:11] <stashbot>	 T116948: Undeploy CodeReview - https://phabricator.wikimedia.org/T116948
[21:56:51] <mutante>	 one rule is about Special:Code  but others are about Special:CodeReview
[22:01:54] <wikibugs>	 (03PS1) 10Dzahn: httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948)
[22:02:32] <Bsadowski1>	 I can access English atm
[22:05:42] <zabe>	 The redirects from mw.o/Special:Code to static-codereview.wikimedia.org should still work, so when the tests are alerting that means that the tests are not completly correct or that they depend on https://gerrit.wikimedia.org/r/c/operations/puppet/+/774943 or some other fix
[22:07:55] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: relocating_shards: 0, status: green, number_of_nodes: 2, cluster_name: relforge-eqiad-small-alpha, delayed_unassigned_shards: 0, initializing_shards: 0, timed_out: False, active_shards_percent_as_number: 100.0, unassigned_shards: 0, active_primary_shards: 37, task_max_waiting_in_queue_millis: 0, number_of_p
[22:07:55] <icinga-wm>	 asks: 0, active_shards: 42, number_of_in_flight_fetch: 0, number_of_data_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:10:33] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[22:10:39] <mutante>	 zabe: 2 of them failed. and they are:
[22:10:56] <mutante>	 Status code: expected 302, got 404.  - https://www.mediawiki.org/w/index.php?title=Special:CodeReview&path=foo
[22:11:17] <mutante>	 Status code: expected 302, got 404.  - https://www.mediawiki.org/w/index.php?title=Special:Code&path=foo
[22:11:43] <mutante>	 zabe: amending https://gerrit.wikimedia.org/r/c/operations/puppet/+/797533/1/modules/profile/files/httpbb/appserver/test_main.yaml
[22:12:45] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1003.eqiad.wmnet with OS bullseye
[22:12:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:46] <mutante>	 zabe: or not. just adding reviewers.. heh
[22:14:31] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "I belive the static version stays, only the mw extension has to be removed?- but someone else here should confirm." [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn)
[22:16:25] <zabe>	 what would you say about only removing the two failing ones for now? It seems like the rewrite rules need some tweaking in order to work for cases aswell.
[22:16:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] httpbb: remove tests for undeployed CodeReview extension (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn)
[22:17:41] <perryprog>	 BTW, what was the cause of the earlier outage, since the httpbb failures were unrelated to that? I was looking for follow-up on it but didn't see any.
[22:19:16] <wikibugs>	 (03PS4) 10Zabe: Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004)
[22:24:54] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:25:50] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:35:44] <wikibugs>	 (03PS1) 10Jdlrobson: mediawiki.skinning: `transition-duration` accessibility override set to `0` [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797219 (https://phabricator.wikimedia.org/T308979)
[22:37:30] <wikibugs>	 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10aaron) Note that MYSQLI_OPT_READ_TIMEOUT can only be set once per https://bugs.php.net/bug.php?id=76703
[22:39:30] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:41:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[22:41:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[22:41:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298555)', diff saved to https://phabricator.wikimedia.org/P28368 and previous config saved to /var/cache/conftool/dbconfig/20220523-224119-ladsgroup.json
[22:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:27] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[22:53:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:06:42] <wikibugs>	 (03CR) 10Jforrester: "Hmm. These were meant to have been adjusted so they wouldn't alert when the extension was undeployed, because they were asserting that the" [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn)
[23:08:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298555)', diff saved to https://phabricator.wikimedia.org/P28369 and previous config saved to /var/cache/conftool/dbconfig/20220523-230851-ladsgroup.json
[23:08:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:08:58] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[23:15:53] <wikibugs>	 (03PS2) 10Dzahn: httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948)
[23:16:27] <wikibugs>	 (03CR) 10Dzahn: "amended. now only removing what _actually_ fails currently. that was line 78 and line 82." [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn)
[23:16:38] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn)
[23:17:04] <wikibugs>	 (03CR) 10Dzahn: httpbb: remove tests for undeployed CodeReview extension (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn)
[23:17:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn)
[23:18:36] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn)
[23:20:00] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:20:46] <mutante>	 !log cumin1001 - systemtl start httpbb_hourly_appserver after deploying gerrit:797533 leads to '+icinga-wm> RECOVERY - Check systemd state on cumin1001 is OK: OK"  T116948
[23:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:58] <stashbot>	 T116948: Undeploy CodeReview - https://phabricator.wikimedia.org/T116948
[23:21:19] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] "manually started: <+icinga-wm> RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wi" [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn)
[23:22:40] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:23:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28370 and previous config saved to /var/cache/conftool/dbconfig/20220523-232357-ladsgroup.json
[23:24:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:43] <wikibugs>	 (03CR) 10Krinkle: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz)
[23:32:06] <icinga-wm>	 RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:39:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28371 and previous config saved to /var/cache/conftool/dbconfig/20220523-233902-ladsgroup.json
[23:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:31] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "blocked on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/789725 as indeed currently a missing localLB is "fixed" by service wiring via" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz)
[23:47:38] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@02f2375]: increase driver jvm heap for convert_to_esbulk
[23:47:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:56] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@02f2375]: increase driver jvm heap for convert_to_esbulk (duration: 02m 18s)
[23:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298555)', diff saved to https://phabricator.wikimedia.org/P28372 and previous config saved to /var/cache/conftool/dbconfig/20220523-235407-ladsgroup.json
[23:54:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[23:54:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[23:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:13] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[23:54:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298555)', diff saved to https://phabricator.wikimedia.org/P28373 and previous config saved to /var/cache/conftool/dbconfig/20220523-235415-ladsgroup.json
[23:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:54:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:11] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 (resolved, follow-up pending) - https://phabricator.wikimedia.org/T308940 (10Dzahn) 05Open→03Resolved a:03Dzahn https://wikitech.wikimedia.org/wiki/Incidents/2022-05-21_-_varnish_cache_busting
[23:56:22] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 96 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 170, active_shards: 211, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 94, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, numbe
[23:56:22] <icinga-wm>	 flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.72964169381108 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:59:16] <koi>	 I repeatedly receive email notification from Gerrit (V+2, CR+2 etc.) about a already merged patch.. is this some problem from my side?
[23:59:29] <koi>	 *several