[00:12:03] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-fetchimage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:17] PROBLEM - Disk space on gitlab1001 is CRITICAL: DISK CRITICAL - free space: /mnt/gitlab-backup 0 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops [00:53:27] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:17:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:17] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:26:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:27] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:36:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:36:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:38:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:38:51] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48107 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:18:05] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:39:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [04:39:59] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [04:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [04:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:33] !log marostegui@cumin1001 Updating IPMI password on 1 hosts - marostegui@cumin1001 [04:40:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [04:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:02] akosiaris: ^ enjoy [04:53:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1143.eqiad.wmnet with reason: Maintenance [04:53:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1143.eqiad.wmnet with reason: Maintenance [04:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28280 and previous config saved to /var/cache/conftool/dbconfig/20220523-045404-ladsgroup.json [04:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:12] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [04:55:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [04:55:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [04:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28281 and previous config saved to /var/cache/conftool/dbconfig/20220523-045548-ladsgroup.json [04:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1118 reimage to bulseye', diff saved to https://phabricator.wikimedia.org/P28282 and previous config saved to /var/cache/conftool/dbconfig/20220523-045850-marostegui.json [04:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [05:03:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [05:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T303603)', diff saved to https://phabricator.wikimedia.org/P28283 and previous config saved to /var/cache/conftool/dbconfig/20220523-050341-ladsgroup.json [05:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:46] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [05:06:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T303603)', diff saved to https://phabricator.wikimedia.org/P28284 and previous config saved to /var/cache/conftool/dbconfig/20220523-050624-ladsgroup.json [05:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1118.eqiad.wmnet with OS bullseye [05:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:27] * kart_ updating cxserver [05:12:29] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:15:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1118.eqiad.wmnet with reason: host reimage [05:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1118.eqiad.wmnet with reason: host reimage [05:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:58] jouncebot: nowandnext [05:20:58] For the next 1 hour(s) and 39 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220522T0700) [05:20:59] In 1 hour(s) and 39 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T0700) [05:21:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28285 and previous config saved to /var/cache/conftool/dbconfig/20220523-052130-ladsgroup.json [05:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:08] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:43] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:38] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/WikimediaMaintenance/fixT308895BrokenRenames.php: Backport: [[gerrit:793800|Add a script to fix T308895 renames (T308895)]] (duration: 00m 51s) [05:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:43] T308895: GlobalRename not renaming some accounts - https://phabricator.wikimedia.org/T308895 [05:26:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [05:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [05:27:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [05:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:53] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [05:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:52] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:49] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:45] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:01] !log Updated cxserver to 2022-05-22-062659-production (T290847) [05:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:06] T290847: Generate template parameter alignments for languages of interest to Section Translation - https://phabricator.wikimedia.org/T290847 [05:35:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1118.eqiad.wmnet with OS bullseye [05:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:43] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:36:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28286 and previous config saved to /var/cache/conftool/dbconfig/20220523-053635-ladsgroup.json [05:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P28287 and previous config saved to /var/cache/conftool/dbconfig/20220523-054311-root.json [05:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:51:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1172.eqiad.wmnet with OS bullseye [05:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T303603)', diff saved to https://phabricator.wikimedia.org/P28288 and previous config saved to /var/cache/conftool/dbconfig/20220523-055140-ladsgroup.json [05:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:45] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [05:53:54] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:57:10] jouncebot: nowandnext [05:57:10] For the next 1 hour(s) and 2 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220522T0700) [05:57:10] In 1 hour(s) and 2 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T0700) [05:57:26] cool. Going to deploy stuff [05:58:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P28289 and previous config saved to /var/cache/conftool/dbconfig/20220523-055815-root.json [05:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:20] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788385|TimedMediaHandler: Disabled the BetaFeature from wikis (T248418)]] (duration: 00m 51s) [06:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:26] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [06:00:36] good morning Amir1, would you mind pinging me when you're done? [06:00:46] urbanecm: good morning, sure [06:01:23] thanks [06:02:02] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:612350|TimedMediaHandler: Drop Beta Feature, no longer usable (T248418)]] (duration: 00m 52s) [06:02:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage [06:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:13] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:612351|TimedMediaHandler: Don't read wmgTmhWebPlayer (T248418)]] (duration: 00m 50s) [06:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: host reimage [06:04:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:04:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:59] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:612352|TimedMediaHandler: Drop pre-switch config, no longer read (T248418)]] (duration: 00m 54s) [06:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:04] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [06:07:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:43] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793763|Turn on WRITE BOTH for templatelink migration in enwiki (T299421)]] (duration: 00m 51s) [06:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:49] T299421: Turn on write both in production for templatelinks normalization - https://phabricator.wikimedia.org/T299421 [06:12:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:41] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:13:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P28290 and previous config saved to /var/cache/conftool/dbconfig/20220523-061319-root.json [06:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:13:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:49] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:14:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:22:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1172.eqiad.wmnet with OS bullseye [06:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P28291 and previous config saved to /var/cache/conftool/dbconfig/20220523-062822-root.json [06:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:46] !log urbanecm@mwmaint1002:~$ foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/migrateMenteeOverviewFiltersToPresets.php --update # T304057 [06:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:51] T304057: Migrate growthexperiments-mentee-overview-filters to growthexperiments-mentee-overview-presets - https://phabricator.wikimedia.org/T304057 [06:35:22] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:791605|Remove unused OggThumbLocation config variable (T308191)]] (duration: 00m 51s) [06:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:26] T308191: Remove oggThumb from TMH - https://phabricator.wikimedia.org/T308191 [06:36:33] urbanecm: I'm done finally [06:36:44] thanks [06:37:53] my should be fairly quick [06:38:40] !log urbanecm@deploy1002 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 00m 52s) [06:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:32] * urbanecm done [06:39:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:40:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P28292 and previous config saved to /var/cache/conftool/dbconfig/20220523-064326-root.json [06:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:01] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:48:49] RECOVERY - Check that envoy is running on idp-test2002 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [06:49:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:50:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P28293 and previous config saved to /var/cache/conftool/dbconfig/20220523-065830-root.json [06:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:58:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T0700). [07:00:04] kart_ and DannyS712: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:52] * kart_ is here [07:01:17] * DannyS712 is here [07:01:42] I'll start with +2 to wmf.12 patch and CI will take few minutes - meanwhile will deploy config patch. [07:02:17] can I add a 5th patch (makes it 7 total) - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796353 ? [07:02:37] (all of my patches are phpcs cleanup and shouldn't actually change anything) [07:02:50] DannyS712: sure. Go ahead. [07:03:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28294 and previous config saved to /var/cache/conftool/dbconfig/20220523-070314-ladsgroup.json [07:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:20] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:06:26] good morning [07:06:29] Looks like mw-config patch merged notification no longer appear here? [07:08:48] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Razzi out of all services on: 562 hosts [07:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Razzi out of all services on: 562 hosts [07:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:18] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Razzi out of all services on: 1227 hosts [07:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:34] Deploying config patch.. [07:09:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Razzi out of all services on: 1227 hosts [07:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:57] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793444|Enable ContentTranslation as default for cs, el, he, ko and tr WPs (T298239 T304853 T304854 T304855 T304863)]] (duration: 00m 50s) [07:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:06] T304854: Enable Content and Section Translation for Greek Wikipedia - https://phabricator.wikimedia.org/T304854 [07:10:07] T304855: Enable Content and Section Translation for Czech Wikipedia - https://phabricator.wikimedia.org/T304855 [07:10:08] T304863: Enable Content and Section Translation for Hebrew Wikipedia - https://phabricator.wikimedia.org/T304863 [07:10:08] T304853: Enable Content and Section Translation for Turkish Wikipedia - https://phabricator.wikimedia.org/T304853 [07:10:08] T298239: Enable Content and Section Translation for Korean Wikipedia - https://phabricator.wikimedia.org/T298239 [07:11:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:06] DannyS712: Waiting for CI for wmf.12 patch now.. [07:12:33] Seems 8 minutes.. [07:13:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1076-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:13:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P28295 and previous config saved to /var/cache/conftool/dbconfig/20220523-071334-root.json [07:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:14:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:43] kart_ okay. Can I add a 6th patch for me / 8th overall? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796354 [07:16:37] (I know normally its a max of 6 patches but since these are no-ops I thought it might be okay) [07:17:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28296 and previous config saved to /var/cache/conftool/dbconfig/20220523-071728-ladsgroup.json [07:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:31] DannyS712: As long as it can fit into window and no-ops :) [07:17:32] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:18:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28297 and previous config saved to /var/cache/conftool/dbconfig/20220523-071819-ladsgroup.json [07:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:39] okay, then I'll keep adding patches and we'll see what we get to. I think your wmf.12 patch merged [07:22:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:23:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:24] Testing my patch on mwdebug1001 [07:24:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:59] Deploying now.. [07:25:40] !log kartik@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/ContentTranslation/modules/base/mw.cx.SiteMapper.js: Backport: [[gerrit:796351|Sitemapper: Fix the configuration override (T308802)]] (duration: 00m 51s) [07:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:46] T308802: Content Translation redirects back to "Start translation" when dashboard is loaded from contribution menu - https://phabricator.wikimedia.org/T308802 [07:26:24] DannyS712: I'm done. [07:27:34] okay. Just realized - you were self-deploying your patches, but I can't do that for my own patches because I don't have deployment rights [07:28:04] Oh, I thought you're doing it yourself :/ [07:28:14] Is urbanecm around? [07:28:40] Yes. What's up? [07:28:58] urbanecm: DannyS712's patches need help. [07:29:14] help = deployment [07:29:35] Oh yeah. [07:29:46] I need to go to Lunch in few minutes. [07:30:03] well, let's have a look then [07:30:04] jouncebot: now [07:30:04] For the next 0 hour(s) and 29 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T0700) [07:30:16] all no-op phpcs cleanup [07:30:36] DannyS712: all patches has CI failure in one of checks? [07:31:03] kart_: that's because they're no-ops. [07:31:30] operations-mw-config-php72-composer-diffConfig-docker expects a change to be made by a config change, which is a reasonable assumption, but in this case, it's okay there is no change :) [07:31:34] where are wikibugs btw? [07:31:34] Haven't look at in code, sorry :/ [07:31:38] no problem :) [07:32:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P28298 and previous config saved to /var/cache/conftool/dbconfig/20220523-073233-ladsgroup.json [07:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28299 and previous config saved to /var/cache/conftool/dbconfig/20220523-073324-ladsgroup.json [07:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:33] PROBLEM - Memcached on idp-test1002 is CRITICAL: connect to address 208.80.154.72 and port 11000: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [07:35:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:35:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:45] urbanecm I have more than the 4 patches that would meet the 6 maximum normally imposed for backport windows, but since these are all no-ops would you be willing to deploy more than the 4? [07:36:02] DannyS712: i'm reviewing them all, we should be able to do them [07:36:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:25] it's easier to deploy obvious no-ops like those patches :) [07:38:01] !log urbanecm@deploy1002 Synchronized private/readme.php: 7a8d8a06: phpcs: move DisallowYodaConditions exclusion inline (duration: 00m 49s) [07:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:15] !log urbanecm@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: e6fb9266: phpcs: enable FunctionComment.MissingDocumentationPrivate (duration: 01m 30s) [07:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:44] !log urbanecm@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: 8f8b04e0: phpcs: enable PropertyDocumentation.WrongStyle (duration: 00m 49s) [07:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:42:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:33] there are a few later patches that are not on the wikitech page but still in the same relation chain, I'll update wikitech with what is actually getting deployed at the end [07:42:35] !log urbanecm@deploy1002 Synchronized w/fatal-error.php: 8f8b04e0: phpcs: enable PropertyDocumentation.WrongStyle (duration: 00m 50s) [07:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:43] !log urbanecm@deploy1002 Synchronized w/fatal-error.php: 7c28808: phpcs: enable and suppress DuplicateClassName.Found (duration: 00m 48s) [07:43:45] DannyS712: does that mean you want me to review&deploy more patches than what's at the wikitech page? [07:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:09] urbanecm if you're willing, yes [07:44:25] DannyS712: in that case, please list them in the calendar :) [07:44:44] okay, its https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796356/2 and the follow-up to that and 1 more I'll create in a second [07:46:31] third one is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796358 [07:47:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P28300 and previous config saved to /var/cache/conftool/dbconfig/20220523-074739-ladsgroup.json [07:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:19] !log urbanecm@deploy1002 Synchronized src/: 0e012139: phpcs: enable PropertyDocumentation.MissingDocumentationPrivate (duration: 00m 50s) [07:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:48:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28301 and previous config saved to /var/cache/conftool/dbconfig/20220523-074829-ladsgroup.json [07:48:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [07:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [07:48:33] also 4th https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796359 [07:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:37] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:48:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28302 and previous config saved to /var/cache/conftool/dbconfig/20220523-074837-ladsgroup.json [07:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:09] !log urbanecm@deploy1002 Synchronized w/fatal-error.php: 0e012139: phpcs: enable PropertyDocumentation.MissingDocumentationPrivate (duration: 00m 49s) [07:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:49:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:31] urbanecm: wikibugs is https://phabricator.wikimedia.org/T308995 [07:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:53] deployment calendar updated [07:49:55] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet [07:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:00] !log urbanecm@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: 0e012139: phpcs: enable PropertyDocumentation.MissingDocumentationPrivate (duration: 00m 50s) [07:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:34] !log urbanecm@deploy1002 Synchronized w/fatal-error.php: a888904: phpcs: enable and suppress ClassMatchesFilename.NotMatch (duration: 00m 49s) [07:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:24] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: a888904: phpcs: enable and suppress ClassMatchesFilename.NotMatch (duration: 00m 49s) [07:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:25] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet [07:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:34] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 86d08457: phpcs: move ForbiddenFunctions.extract exclusion inline (duration: 00m 50s) [07:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:21] !log urbanecm@deploy1002 Synchronized docroot/noc/conf/activeMWVersions.php: e1df8fabc: phpcs: move ForbiddenFunctions.exec exclusion inline (duration: 00m 50s) [07:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:04] DannyS712: and that should be it :) [07:56:25] do you have time for one more? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/796360/4 [07:56:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:56:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:23] DannyS712: we've less than three minutes, that's not enough unfortunately. [07:57:32] okay, then next time [07:57:36] yup :) [07:57:51] still, I got 11 patches merged in record time [07:58:22] can I add this one to the UTC late backport window today that you are deploying? I might not be around then but it should still be a no-op [07:59:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:09] DannyS712: i can't guarantee it'll be me actually doing the deployment though [07:59:32] okay, I'll list it there and hope for the best, I might be able to make it [08:00:27] thanks for reviewing and deploying! :) [08:02:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28303 and previous config saved to /var/cache/conftool/dbconfig/20220523-080244-ladsgroup.json [08:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:50] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [08:04:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:05:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:03] !log fixing renames of 44 accounts T308895 [08:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:10] T308895: GlobalRename not renaming some accounts - https://phabricator.wikimedia.org/T308895 [08:14:03] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:29] can confirm that there are rename logs showing up on enwikiquote [08:19:51] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:16] RECOVERY - Disk space on gitlab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1001&var-datasource=eqiad+prometheus/ops [08:37:30] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:44:22] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:01:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5003.eqsin.wmnet [09:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:56] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:18:51] .11 [09:18:54] uff :) [09:22:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5003.eqsin.wmnet [09:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:16] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:24:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5003.eqsin.wmnet to ganeti01.svc.eqsin.wmnet [09:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5003.eqsin.wmnet to ganeti01.svc.eqsin.wmnet [09:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:57] !log jelto@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab1003.wikimedia.org with OS bullseye [09:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:36:40] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:37:46] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [09:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet [09:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:42] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:38:56] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:40:35] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [09:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:42:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet [09:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:33] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:02] (MXQueueNoMetrics) firing: Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics [09:45:46] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:24] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:02] (MXQueueNoMetrics) firing: (2) Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics [09:50:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:25] hah! the mxqueuenometrics makes sense, I'll fix it [09:54:32] !log failover ganeti master in eqsin to ganeti5003 T308211 [09:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:38] T308211: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 [09:55:02] (MXQueueNoMetrics) firing: (8) Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics [09:55:45] (JobUnavailable) resolved: (4) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:48] !log drain ganeti5001 T308211 [09:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:36] godog: happy for me to merge your cr [09:56:45] jbond: oops! yes please [09:56:50] np doing [09:57:06] * jbond done [09:59:44] RECOVERY - Disk space on gitlab1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab1003&var-datasource=eqiad+prometheus/ops [10:00:02] (MXQueueNoMetrics) firing: (6) Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics [10:00:04] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1003.wikimedia.org with OS bullseye [10:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:37] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:02:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [10:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:26] ^ this includes a restart of kubetcd1005 since not on DRBD [10:04:35] PROBLEM - Host kubetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [10:05:34] PROBLEM - ganeti-wconfd running on ganeti5001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:05:45] RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [10:07:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [10:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28306 and previous config saved to /var/cache/conftool/dbconfig/20220523-100809-ladsgroup.json [10:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:15] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [10:09:12] !log starting reboot of eqiad maps hosts for updates [10:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1005.eqiad.wmnet [10:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:02] (MXQueueNoMetrics) firing: Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics [10:10:22] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 6 hosts with reason: postgres config change [10:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:27] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 6 hosts with reason: postgres config change [10:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [10:12:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [10:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28307 and previous config saved to /var/cache/conftool/dbconfig/20220523-101222-ladsgroup.json [10:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:28] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [10:12:58] (KubernetesRsyslogDown) firing: (4) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:15:02] (MXQueueNoMetrics) firing: (2) Queue length metrics not found - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueNoMetrics [10:15:43] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet [10:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:23] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1006.eqiad.wmnet with reason: security update [10:17:24] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1006.eqiad.wmnet with reason: security update [10:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1006.eqiad.wmnet [10:18:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [10:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28308 and previous config saved to /var/cache/conftool/dbconfig/20220523-102314-ladsgroup.json [10:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1119.eqiad.wmnet with OS bullseye [10:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:09] !log btullis@deploy1002 Started deploy [analytics/superset/deploy@09094de]: (no justification provided) [10:25:12] !log btullis@deploy1002 Finished deploy [analytics/superset/deploy@09094de]: (no justification provided) (duration: 00m 03s) [10:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1119.eqiad.wmnet with reason: host reimage [10:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1119.eqiad.wmnet with reason: host reimage [10:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:49] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1006.eqiad.wmnet [10:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28309 and previous config saved to /var/cache/conftool/dbconfig/20220523-103819-ladsgroup.json [10:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1007.eqiad.wmnet with reason: security update [10:40:13] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1007.eqiad.wmnet with reason: security update [10:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:56] RECOVERY - Check systemd state on maps1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:48] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:40] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1119.eqiad.wmnet with OS bullseye [10:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28310 and previous config saved to /var/cache/conftool/dbconfig/20220523-105324-ladsgroup.json [10:53:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:53:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:31] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [10:53:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28311 and previous config saved to /var/cache/conftool/dbconfig/20220523-105332-ladsgroup.json [10:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti1025.eqiad.wmnet [10:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28312 and previous config saved to /var/cache/conftool/dbconfig/20220523-110043-ladsgroup.json [11:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:49] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [11:01:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [11:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:58] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1007.eqiad.wmnet [11:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:08] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1008.eqiad.wmnet [11:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:17] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1008.eqiad.wmnet with reason: security update [11:11:19] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1008.eqiad.wmnet with reason: security update [11:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:07] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P28313 and previous config saved to /var/cache/conftool/dbconfig/20220523-111548-ladsgroup.json [11:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:08] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1008.eqiad.wmnet with reason: security update [11:18:09] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1008.eqiad.wmnet with reason: security update [11:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1177 to clone db1172', diff saved to https://phabricator.wikimedia.org/P28314 and previous config saved to /var/cache/conftool/dbconfig/20220523-111902-marostegui.json [11:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:29] RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:01] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1008.eqiad.wmnet [11:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:13] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1009.eqiad.wmnet with reason: security update [11:25:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1009.eqiad.wmnet with reason: security update [11:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P28316 and previous config saved to /var/cache/conftool/dbconfig/20220523-113053-ladsgroup.json [11:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:03] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.853e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [11:38:19] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on maps1010.eqiad.wmnet with reason: security update [11:38:20] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on maps1010.eqiad.wmnet with reason: security update [11:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:26] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1010.eqiad.wmnet [11:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti1026.eqiad.wmnet [11:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:35] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28317 and previous config saved to /var/cache/conftool/dbconfig/20220523-114559-ladsgroup.json [11:46:03] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:05] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [11:47:13] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2005.codfw.wmnet [11:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:51:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [11:51:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [11:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28318 and previous config saved to /var/cache/conftool/dbconfig/20220523-115202-ladsgroup.json [11:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:08] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [11:52:51] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2005.codfw.wmnet [11:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:56:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2006.codfw.wmnet [11:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1134.eqiad.wmnet with OS bullseye [11:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:10] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2006.codfw.wmnet [12:01:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [12:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:08] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:03:40] PROBLEM - Host kubestagetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [12:05:34] RECOVERY - Host kubestagetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [12:06:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1134.eqiad.wmnet with reason: host reimage [12:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1134.eqiad.wmnet with reason: host reimage [12:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:58] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:16:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:16:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298555)', diff saved to https://phabricator.wikimedia.org/P28320 and previous config saved to /var/cache/conftool/dbconfig/20220523-121659-ladsgroup.json [12:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:07] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [12:18:38] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:20:30] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:33] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@c9b397c]: T305843_migrate_clickstream_job_from_oozie_to_airflow [airflow-dags/analytics_test@c9b397c] [12:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:42] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@c9b397c]: T305843_migrate_clickstream_job_from_oozie_to_airflow [airflow-dags/analytics_test@c9b397c] (duration: 00m 08s) [12:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:48] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@c9b397c]: T305843_migrate_clickstream_job_from_oozie_to_airflow [airflow-dags/analytics@c9b397c] [12:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:56] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@c9b397c]: T305843_migrate_clickstream_job_from_oozie_to_airflow [airflow-dags/analytics@c9b397c] (duration: 00m 08s) [12:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:29] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:25:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1134.eqiad.wmnet with OS bullseye [12:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [12:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:11] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:23] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:39:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28321 and previous config saved to /var/cache/conftool/dbconfig/20220523-123944-ladsgroup.json [12:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:49] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [12:51:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti5001.eqsin.wmnet with reason: Remove from cluster for firmware update and eventual reimage [12:51:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti5001.eqsin.wmnet with reason: Remove from cluster for firmware update and eventual reimage [12:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28322 and previous config saved to /var/cache/conftool/dbconfig/20220523-125449-ladsgroup.json [12:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T1300). [13:00:05] James_F, koi, and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] My patch has already been deployed. [13:00:21] o/ [13:00:27] Thanks, Amir1. :-) [13:00:34] hi there [13:00:53] I guess I should do the deploys then [13:04:09] seems like wikibugs bot is on vacation [13:07:16] tgr: if you can, would be great :) [13:07:19] wikibugs is T308995 [13:07:19] T308995: wikibugs not show phab/gerrit comments on IRC - https://phabricator.wikimedia.org/T308995 [13:09:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P28323 and previous config saved to /var/cache/conftool/dbconfig/20220523-130954-ladsgroup.json [13:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:32] urbanecm: looks like the same kind of mystery issue that we saw in T291129 and T304180 [13:12:33] T304180: Wikibugs: Quit due to excess flood - https://phabricator.wikimedia.org/T304180 [13:12:33] T291129: wikibugs failing to connect when run on exec hosts - https://phabricator.wikimedia.org/T291129 [13:12:40] posible :) [13:13:51] thanks tgr, no need to test this patch [13:15:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:18] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:794000|Update IP addresses for Wiki Education Dashboard exemptions (T308702)]] (duration: 00m 52s) [13:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:22] T308702: Update IP addresses for Wiki Education Dashboard exemptions to rate-limiting and global block - https://phabricator.wikimedia.org/T308702 [13:16:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:16:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28324 and previous config saved to /var/cache/conftool/dbconfig/20220523-131641-ladsgroup.json [13:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:46] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [13:17:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:50] koi: IP examption patch is live, zhwiki RC patrol patch is on mwdebug1001 [13:18:00] could a sysadmin have a look at T308976? I could not patrol at zhwiki so couldn't check.. [13:18:02] T308976: Enable Recent Changes Patrol for Chinese Wikipedia - https://phabricator.wikimedia.org/T308976 [13:21:50] ping taavi and urbanecm for help ^ [13:22:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:16] ? [13:22:43] what would I look for exactly? [13:22:54] need to check if every new edits has a "mark for patrol" link [13:23:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:23:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298555)', diff saved to https://phabricator.wikimedia.org/P28325 and previous config saved to /var/cache/conftool/dbconfig/20220523-132438-ladsgroup.json [13:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:44] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [13:24:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28326 and previous config saved to /var/cache/conftool/dbconfig/20220523-132459-ladsgroup.json [13:25:04] like for this edit, is there a link to mark for patrol at the top (near the timestamp) [13:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:05] https://zh.wikipedia.org/w/index.php?title=%E4%B8%89%E9%97%96%E5%B0%91%E6%9E%97&type=revision&diff=71783956&oldid=71783920&diffmode=source&uselang=en [13:25:05] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [13:25:06] nothing like that jumps out [13:25:30] I can verify via shell.php that $wgUseRCPatrol is true [13:26:06] +sysadmin doesn't include 'patrol' or 'patrolmarks' needed to see those, only 'autopatrol' [13:27:02] could you self-grant "patroller" right to yourself to check it [13:27:24] No, +sysadmins should never grant themselves rights except in emergencies. [13:27:29] staff does have patrolmarks (though not patrol) [13:28:20] well, anyway let's sync; thought not a big problem [13:28:29] ok [13:29:04] oh, duh, I had enabled xdebug instead of x-wikimedia-debug [13:29:11] ok, I can see the patrol marks [13:29:52] thanks! [13:30:29] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:795526|zhwiki: Enable RCPatrol (T308976)]] (duration: 00m 51s) [13:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:34] T308976: Enable Recent Changes Patrol for Chinese Wikipedia - https://phabricator.wikimedia.org/T308976 [13:31:40] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:31:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28327 and previous config saved to /var/cache/conftool/dbconfig/20220523-133146-ladsgroup.json [13:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:01] koi: patrol is live, itwiki new protection level is on mwdebug1001 [13:32:08] looking [13:32:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [13:32:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [13:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28328 and previous config saved to /var/cache/conftool/dbconfig/20220523-133228-ladsgroup.json [13:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:34] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [13:33:51] tgr, LGTM [13:34:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:35:15] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:794590|itwiki: Add "editautopatrolprotected" protection level (T308917)]] (duration: 00m 52s) [13:35:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:22] T308917: Add "editautopatrolprotected" protection level to itwiki - https://phabricator.wikimedia.org/T308917 [13:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:21] koi: protection level is live, rowiki namespace names are on mwdebug1001 [13:39:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P28329 and previous config saved to /var/cache/conftool/dbconfig/20220523-133944-ladsgroup.json [13:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:51] tgr: LGTM [13:40:06] please also run namespaceDupes.php [13:40:59] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793999|rowiki: Use Romanian canonical name (T127607)]] (duration: 00m 50s) [13:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:05] T127607: Fix canonical namespaces for rowiki - https://phabricator.wikimedia.org/T127607 [13:41:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:11] Is it possible to set configs for specific shards? It should be since there is a .dblist file for each shard, right? [13:42:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:42:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:03] koi: doesn't find anything to fix [13:43:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:43:21] thanks anyway [13:43:27] sine the definitions were just swapped between canonical and alias, I guess that's to be expected [13:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:54] zabe: should be, yes [13:44:01] although I'm quite curious on your use case for that [13:44:54] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/797294 [13:45:36] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/OAuth/src/Frontend/SpecialPages/SpecialMWOAuthConsumerRegistration.php: Backport: [[gerrit:793795|Remove 'required' from callbackIsPrefix (T308880)]] (duration: 00m 50s) [13:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:42] T308880: "callback is prefix" checkbox should not be required during registration - https://phabricator.wikimedia.org/T308880 [13:46:05] !log EU mid-day deploys done [13:46:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1135.eqiad.wmnet with OS bullseye [13:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:19] I'll test the last one in production, it's a trivial change [13:46:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28330 and previous config saved to /var/cache/conftool/dbconfig/20220523-134651-ladsgroup.json [13:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 1%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28331 and previous config saved to /var/cache/conftool/dbconfig/20220523-134657-root.json [13:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:24] zabe: hmm, the diffConfig job doesn't look as expected [13:48:11] hmm, yeah [13:49:40] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2007.codfw.wmnet [13:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:28] hello, first time here, I'm going to list https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/789613/ in the calendar for today's late window (T307683) [13:52:28] T307683: Add localized wordmark for plwiktionary - https://phabricator.wikimedia.org/T307683 [13:54:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P28332 and previous config saved to /var/cache/conftool/dbconfig/20220523-135449-ladsgroup.json [13:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:27] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2007.codfw.wmnet [13:55:28] PeterBowman: Welcome! You should crush the SVG file first, please. [13:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1135.eqiad.wmnet with reason: host reimage [13:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [13:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:08] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2008.codfw.wmnet [13:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1135.eqiad.wmnet with reason: host reimage [13:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [14:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298555)', diff saved to https://phabricator.wikimedia.org/P28334 and previous config saved to /var/cache/conftool/dbconfig/20220523-140156-ladsgroup.json [14:01:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:01:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 5%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28335 and previous config saved to /var/cache/conftool/dbconfig/20220523-140201-root.json [14:02:02] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [14:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:44] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2008.codfw.wmnet [14:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:31] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86] [14:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:37] T295072: Install spark3 in analytics clusters - https://phabricator.wikimedia.org/T295072 [14:08:40] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86] (duration: 00m 08s) [14:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298555)', diff saved to https://phabricator.wikimedia.org/P28336 and previous config saved to /var/cache/conftool/dbconfig/20220523-140954-ladsgroup.json [14:09:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:09:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:58] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [14:10:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298555)', diff saved to https://phabricator.wikimedia.org/P28337 and previous config saved to /var/cache/conftool/dbconfig/20220523-141001-ladsgroup.json [14:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:33] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86] [14:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:42] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86] (duration: 00m 08s) [14:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:13] (KubernetesRsyslogDown) firing: (4) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:14:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1135.eqiad.wmnet with OS bullseye [14:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28338 and previous config saved to /var/cache/conftool/dbconfig/20220523-141705-root.json [14:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:56] !log failover ganeti master in eqiad to ganeti1027 [14:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:22] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2010.codfw.wmnet [14:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:34] !log Add AAAA records to relforge1003 and 1004 T271143 [14:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:39] T271143: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 [14:22:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28339 and previous config saved to /var/cache/conftool/dbconfig/20220523-142202-ladsgroup.json [14:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:09] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [14:23:00] PROBLEM - ganeti-wconfd running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:26:10] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host maps2010.codfw.wmnet [14:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:40] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:39] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host maps2009.codfw.wmnet [14:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28340 and previous config saved to /var/cache/conftool/dbconfig/20220523-143209-root.json [14:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:39] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:14] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:42] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host maps2009.codfw.wmnet [14:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28341 and previous config saved to /var/cache/conftool/dbconfig/20220523-143707-ladsgroup.json [14:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [14:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:34] PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:50] RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [14:46:43] <_joe_> wat [14:46:55] <_joe_> ahh ganeti down [14:46:56] <_joe_> ok [14:47:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28342 and previous config saved to /var/cache/conftool/dbconfig/20220523-144713-root.json [14:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1024.eqiad.wmnet [14:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:22] PROBLEM - ganeti-wconfd running on ganeti1024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:52:11] ^ that'll recover soon, monitoring artefact of the master failover [14:52:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P28343 and previous config saved to /var/cache/conftool/dbconfig/20220523-145212-ladsgroup.json [14:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:58] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:01:51] !log jbond@cumin1001 START - Cookbook sre.ganeti.makevm for new host netbox1002.eqiad.wmnet [15:01:52] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [15:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:10] !log rebooting ms-be2069 to look at disk config [15:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28345 and previous config saved to /var/cache/conftool/dbconfig/20220523-150217-root.json [15:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:40] PROBLEM - Host ms-be2069 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:17] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28346 and previous config saved to /var/cache/conftool/dbconfig/20220523-150717-ladsgroup.json [15:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:21] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [15:07:43] Emperor: fyi if you use `sudo cookbook sre.hosts.reboot-single $hostname` it will take care of downtiming the host in icinga [15:08:20] oh, duh, yes, sorry. [15:08:26] no problem :) [15:11:32] RECOVERY - Host ms-be2069 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [15:12:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:12:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28348 and previous config saved to /var/cache/conftool/dbconfig/20220523-151207-ladsgroup.json [15:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:36] jouncebot: nowandnext [15:12:36] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [15:12:36] In 0 hour(s) and 17 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T1530) [15:13:06] !log poweroff cp2038 for maintenance [15:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:05] !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netbox1002.eqiad.wmnet [15:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1184.eqiad.wmnet with OS bullseye [15:17:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: After recloning db1172', diff saved to https://phabricator.wikimedia.org/P28349 and previous config saved to /var/cache/conftool/dbconfig/20220523-151721-root.json [15:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:33] jan_drewniak: please wait with your portals deployment until further notice, there is an urgent security issue me and taavi would like to fix before that. [15:18:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298555)', diff saved to https://phabricator.wikimedia.org/P28350 and previous config saved to /var/cache/conftool/dbconfig/20220523-151826-ladsgroup.json [15:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:32] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [15:24:45] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:25:37] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:26:05] !log deploy patch for T309028 [15:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet [15:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage [15:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:05] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T1530). [15:30:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:30:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage [15:32:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet [15:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:31] jan_drewniak: we're done, you can proceed as usual [15:32:56] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1020.eqiad.wmnet with OS bullseye [15:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28351 and previous config saved to /var/cache/conftool/dbconfig/20220523-153331-ladsgroup.json [15:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:06] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [15:34:06] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host sretest1001.eqiad.wmnet with OS buster [15:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:15] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [15:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:35:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:27] hey, we need to do a deployment for the Wikifeeds service out of the deployment window for the apps fundraising campaign, are there any questions or concerns about doing that nowish? [15:38:47] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp2038.codfw.wmnet [15:38:48] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp2038.codfw.wmnet [15:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:13] !log pool cp2038 - T308459 [15:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:18] T308459: codfw: cp2038 Correctable memory error on DIMM A3 - https://phabricator.wikimedia.org/T308459 [15:42:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet [15:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:42] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:04] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:42] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [15:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:45] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti5001.eqsin.wmnet with OS bullseye [15:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:28] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [15:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:57] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [15:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:42] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [15:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:02] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [15:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet [15:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:56] (03PS3) 10Jbond: netbox: add new netbox serveres to netbox::fronend [puppet] - 10https://gerrit.wikimedia.org/r/797329 [15:47:40] 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10valhallasw) @Marostegui which command(s) did you run, exactly? ` tools.wikibugs@tools-sgebastion-10:~$ kubectl get pods NAME READY ST... [15:48:10] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) After spending some time deploying and testing dispatch in a POC lab environment (dispatch[12].sre-sandbox.eqiad1.wikimedia.cloud), here are my r... [15:48:14] 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10DannyS712) Wikibugs just joined `#wikimedia-operations` [15:48:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28352 and previous config saved to /var/cache/conftool/dbconfig/20220523-154836-ladsgroup.json [15:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:50] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [15:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1184.eqiad.wmnet with OS bullseye [15:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:11] (03CR) 10CI reject: [V: 04-1] netbox: add new netbox serveres to netbox::fronend [puppet] - 10https://gerrit.wikimedia.org/r/797329 (owner: 10Jbond) [15:50:15] 10SRE, 10Wikibugs: wikibugs has stopped showing phab/gerrit comments on IRC as of 2022-05-22Z17:00 - https://phabricator.wikimedia.org/T308995 (10Marostegui) Thanks @valhallasw! I followed what wikitech mentions. Maybe we should write it clearer so we don't have to bother you again :-) [15:51:13] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:53:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet [15:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:25] PROBLEM - Host kubestagetcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:31] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade Fastnetmon to 1.2.1 - https://phabricator.wikimedia.org/T271228 (10MoritzMuehlenhoff) @ayounsi I've built a backport of fastnetmon 1.2.1 for bullseye-wikimedia. It's not yet uploaded to apt.wikimedia.org, let's sync up for some smoke testing when you're... [15:57:36] ^ kubestagetcd2002 is the ganeti reboot [15:57:59] (03PS2) 10Muehlenhoff: thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013) [15:58:51] (03PS4) 10Jbond: netbox: add new netbox serveres to netbox::fronend [puppet] - 10https://gerrit.wikimedia.org/r/797329 [15:59:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet [15:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:08] (03CR) 10Muehlenhoff: [C: 03+2] thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793775 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:00:49] RECOVERY - Host kubestagetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.46 ms [16:01:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28353 and previous config saved to /var/cache/conftool/dbconfig/20220523-160105-ladsgroup.json [16:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:12] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [16:01:21] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [16:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:01:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:45] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1020.eqiad.wmnet with OS bullseye [16:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:49] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host aqs1020.eqiad.wmnet with OS bullseye executed with errors: - aqs1020... [16:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:23] (03PS2) 10Muehlenhoff: klaxon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793773 (https://phabricator.wikimedia.org/T308013) [16:03:40] (03PS2) 10Muehlenhoff: helm/helmfile: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793774 (https://phabricator.wikimedia.org/T308013) [16:03:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298555)', diff saved to https://phabricator.wikimedia.org/P28354 and previous config saved to /var/cache/conftool/dbconfig/20220523-160341-ladsgroup.json [16:03:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:03:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:47] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [16:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:22] (03CR) 10Jbond: [C: 03+2] netbox: add new netbox serveres to netbox::fronend [puppet] - 10https://gerrit.wikimedia.org/r/797329 (owner: 10Jbond) [16:06:57] (03CR) 10Muehlenhoff: [C: 03+2] klaxon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793773 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:06:58] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:10:56] (03PS1) 10Jbond: P:netbox::automation: Drop Acme dependency [puppet] - 10https://gerrit.wikimedia.org/r/797338 (https://phabricator.wikimedia.org/T296452) [16:11:27] (03PS1) 10Zabe: toolforge: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797339 (https://phabricator.wikimedia.org/T308013) [16:13:07] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:13:51] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5001.eqsin.wmnet with reason: host reimage [16:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:15:25] (03PS1) 10Zabe: tomcat: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797341 (https://phabricator.wikimedia.org/T308013) [16:16:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28355 and previous config saved to /var/cache/conftool/dbconfig/20220523-161610-ladsgroup.json [16:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet [16:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:00] (03PS3) 10Zabe: postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673) [16:17:07] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5001.eqsin.wmnet with reason: host reimage [16:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:33] (03CR) 10STran: [C: 03+1] Deploy IPInfo to all wikis by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793841 (https://phabricator.wikimedia.org/T260597) (owner: 10Tchanders) [16:19:01] (03CR) 10STran: [C: 03+1] Add SimilarEditors extension – I: Add to i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789250 (https://phabricator.wikimedia.org/T306909) (owner: 10Tchanders) [16:19:01] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:19:45] (03CR) 10Jbond: [C: 03+2] P:netbox::automation: Drop Acme dependency [puppet] - 10https://gerrit.wikimedia.org/r/797338 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:20:54] does anyone know where I could find documentation for $wgDontNotUnDisenableInstantCommons ? [16:21:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet [16:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:08] DannyS712: I don't think that's a thing - https://codesearch.wmcloud.org/search/?q=DontNotUnDisenableInstantCommons&i=nope&files=&excludeFiles=&repos= [16:22:42] https://bash.toolforge.org/quip/AU7VVSDt6snAnmqnK_wG suggests it is :) [16:23:40] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:11] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:28:19] !log adding AAAA records for cloudelastic100[1-6] T271143 [16:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:23] T271143: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 [16:29:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:29] (03CR) 10Ottomata: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/797127 (https://phabricator.wikimedia.org/T296982) (owner: 10Elukey) [16:31:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28356 and previous config saved to /var/cache/conftool/dbconfig/20220523-163116-ladsgroup.json [16:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:28] ^^ Ignore my earlier log msg, cloudelastic already has AAAA records [16:38:54] (03PS1) 10Volans: sre.dns.netbox: limit matching hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/797343 (https://phabricator.wikimedia.org/T296452) [16:39:09] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5001.eqsin.wmnet with OS bullseye [16:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:15] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti5001.eqsin.wmnet with OS bullseye completed: - ganeti5001 (**PASS**) - Downtimed on Icinga/Ale... [16:39:39] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/797343 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans) [16:40:51] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH) a:05RobH→03MoritzMuehlenhoff @MoritzMuehlenhoff, per our earlier IRC discussion, ganeti5001 has had all the firmware updated and reimaged successfully. All yours! [16:41:31] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 (10RobH) [16:44:16] !log add AAAA records to elastic202[5-9] T271143 [16:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:22] T271143: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 [16:46:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28357 and previous config saved to /var/cache/conftool/dbconfig/20220523-164621-ladsgroup.json [16:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:27] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [16:47:58] (03PS1) 10Ladsgroup: db1106: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/797344 (https://phabricator.wikimedia.org/T303171) [16:48:30] (03PS1) 10Ladsgroup: Revert "db1184: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797217 [16:48:49] (03PS2) 10Ladsgroup: Revert "db1184: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797217 [16:49:30] (03CR) 10Ladsgroup: [C: 03+2] db1106: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/797344 (https://phabricator.wikimedia.org/T303171) (owner: 10Ladsgroup) [16:49:52] (03PS3) 10Ladsgroup: Revert "db1184: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797217 [16:49:56] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1184: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797217 (owner: 10Ladsgroup) [16:50:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:50:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:50:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28358 and previous config saved to /var/cache/conftool/dbconfig/20220523-165045-ladsgroup.json [16:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:36] (03CR) 10Volans: [V: 03+2 C: 03+2] "zuul stuck with the queue, trivial urgent change to unblock work." [cookbooks] - 10https://gerrit.wikimedia.org/r/797343 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans) [16:52:52] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:46] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) [16:59:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1106.eqiad.wmnet with OS bullseye [16:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] ryankemper: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T1700). [17:00:51] !log bking@cumin1001 START - Cookbook sre.dns.netbox [17:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/797341 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:04:06] (03Restored) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (owner: 10Winston Sung) [17:04:13] (03CR) 10Jbond: [C: 03+2] tomcat: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797341 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:04:14] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:33] (03PS3) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 [17:04:39] (03Abandoned) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (owner: 10Winston Sung) [17:04:58] (03CR) 10Jbond: [C: 03+2] postgresql: remove absented backup crons [puppet] - 10https://gerrit.wikimedia.org/r/777434 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:06:18] (03Restored) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (owner: 10Winston Sung) [17:06:23] (03PS3) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 [17:06:28] (03Abandoned) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (owner: 10Winston Sung) [17:06:35] (03Restored) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (owner: 10Winston Sung) [17:06:47] (03PS4) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 [17:06:51] (03Abandoned) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (owner: 10Winston Sung) [17:07:41] (03PS2) 10Jbond: P:ssh::client: Add GSSAPIDelegateCredentials support to ssh::client [puppet] - 10https://gerrit.wikimedia.org/r/791567 [17:08:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1106.eqiad.wmnet with reason: host reimage [17:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:15] (03Restored) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (owner: 10Winston Sung) [17:09:26] (03PS3) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 [17:09:36] (03Abandoned) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (owner: 10Winston Sung) [17:09:45] (03Restored) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (owner: 10Winston Sung) [17:09:48] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service,rq-netbox.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:58] (03PS4) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 [17:10:02] (03Abandoned) 10Winston Sung: [Abandoned] [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (owner: 10Winston Sung) [17:10:27] (03Restored) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (owner: 10Winston Sung) [17:10:48] (03PS5) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 [17:10:53] (03Abandoned) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (owner: 10Winston Sung) [17:11:03] (03Restored) 10Winston Sung: Add tests closer to real use cases for Special:MyLanguage [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611 (https://phabricator.wikimedia.org/T278639) (owner: 10Winston Sung) [17:11:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1106.eqiad.wmnet with reason: host reimage [17:11:12] (03PS3) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611 [17:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:17] (03Abandoned) 10Winston Sung: [Abandoned] [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788611 (owner: 10Winston Sung) [17:13:24] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10valhallasw) [17:14:18] (03PS1) 10Papaul: ADd DNS for new frbackuup node [dns] - 10https://gerrit.wikimedia.org/r/797347 [17:19:03] (03CR) 10Dzahn: [C: 03+2] gitlab: reduce backup_keep_time to 2d [puppet] - 10https://gerrit.wikimedia.org/r/797278 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [17:19:56] (03CR) 10Papaul: [C: 03+2] ADd DNS for new frbackuup node [dns] - 10https://gerrit.wikimedia.org/r/797347 (owner: 10Papaul) [17:23:59] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Papaul) [17:26:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1106.eqiad.wmnet with OS bullseye [17:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:18] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:24] (03PS5) 10AOkoth: vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) [17:34:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28359 and previous config saved to /var/cache/conftool/dbconfig/20220523-173439-ladsgroup.json [17:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:45] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [17:36:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:03] (03PS1) 10Samtar: changeprop: Remove WP:ANI from page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/797354 (https://phabricator.wikimedia.org/T274359) [17:47:29] (03PS1) 10Zabe: toil: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797355 (https://phabricator.wikimedia.org/T308013) [17:49:07] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Papaul) switch configuration ` papaul@fasw-c-codfw# run show interfaces ge-[0-1]/0/11 descriptions Interface Admin Link Description ge-0/0/11 u... [17:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P28360 and previous config saved to /var/cache/conftool/dbconfig/20220523-174944-ladsgroup.json [17:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:19] (03PS1) 10Zabe: tmpreaper: Add SPDX header [puppet] - 10https://gerrit.wikimedia.org/r/797362 (https://phabricator.wikimedia.org/T308013) [17:53:39] (03CR) 10Samtar: "Again, my uninformed test plan is T274359#7751644, but in this case *just* testing https://en.wikipedia.org/wiki/Wikipedia:Administrators%" [deployment-charts] - 10https://gerrit.wikimedia.org/r/797354 (https://phabricator.wikimedia.org/T274359) (owner: 10Samtar) [17:53:54] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Jdlrobson) [17:54:56] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Papaul) [17:56:03] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Papaul) a:05Papaul→03Jgreen @Jgreen all yours [17:56:12] (03PS1) 10Zabe: threedtopng: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/797366 (https://phabricator.wikimedia.org/T308013) [18:00:27] (03CR) 10STran: [C: 03+1] Add comment to consult Legal before updating IPInfo access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders) [18:01:40] (03CR) 10STran: [C: 03+1] Remove outdated comment about IPInfo from CommonSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793848 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders) [18:03:45] (03CR) 10Dzahn: "the team owning the license for these is the Anti Harassment team fwiw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders) [18:04:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P28361 and previous config saved to /var/cache/conftool/dbconfig/20220523-180449-ladsgroup.json [18:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:56] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@d1f4367]: T307983: weekly import of image suggestions [18:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:01] T307983: Write search index data for image suggestions into a hive table rather than local hdfs files - https://phabricator.wikimedia.org/T307983 [18:07:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:07:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:17] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@d1f4367]: T307983: weekly import of image suggestions (duration: 02m 21s) [18:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:21] (03PS1) 10Ladsgroup: Revert "db1106: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797218 [18:11:26] (03PS2) 10Ladsgroup: Revert "db1106: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797218 [18:11:45] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1106: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/797218 (owner: 10Ladsgroup) [18:19:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28364 and previous config saved to /var/cache/conftool/dbconfig/20220523-181954-ladsgroup.json [18:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:00] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [18:25:16] !log T308647 Bringing `elastic2054` back into service: `ryankemper@elastic2054:~$ sudo pool` (it's not currently banned from cluster so nothing to do there) [18:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:22] T308647: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 [18:25:59] 10SRE, 10ops-codfw, 10CirrusSearch, 10DC-Ops, 10Discovery-Search: elastic2054 is having H/W issues - https://phabricator.wikimedia.org/T308647 (10RKemper) Thanks for looking into this, all. I've brought the host back into service and will reopen the ticket if problems re-surface, but for now things look... [18:31:27] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Sgs) [18:32:32] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@2d8e8d1]: (no justification provided) [18:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:40] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@2d8e8d1]: (no justification provided) (duration: 00m 07s) [18:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:44] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10MShilova_WMF) I confirm that @sgs needs access to a production server and it is currently blocking {https://phabricator.wikimedia.org/T307454}. More context for that task can be... [18:39:14] Hey all, I have a feeling https://tools-prometheus.wmflabs.org/tools/api/v1/query_range isn't meant to be returning a 503 (was trying to figure out why https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&refresh=5m&var-namespace=tool-refill wasn't loading) [18:39:39] of course it starts working now [18:42:30] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) AAAA records successfully added for elastic202[5-9]: ` for n in $(cat codfw.hosts); do quad=$(dig aaaa +short ${n});pri... [18:48:38] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10Discovery-Search (Current work), 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10Gehel) [18:53:13] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:57:09] (03CR) 10Tchanders: Add comment to consult Legal before updating IPInfo access (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793849 (https://phabricator.wikimedia.org/T308876) (owner: 10Tchanders) [19:02:41] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:05:11] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:05:37] TheresNoTime: fyi, issues like the one you raised are best noted in -cloud :) [19:09:39] (03PS1) 10Majavah: nrpe: manage sudo rules via nrpe::check [puppet] - 10https://gerrit.wikimedia.org/r/797422 [19:11:48] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35505/console" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [19:12:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:12:30] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/797423 (https://phabricator.wikimedia.org/T303515) [19:12:44] (03PS1) 10Clare Ming: Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607) [19:12:52] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/797423 (https://phabricator.wikimedia.org/T303515) (owner: 10Kosta Harlan) [19:14:36] (03CR) 10Dzahn: "forgive my ignorance but wouldn't it be much easier to have the same base class or " include ::nrpe" as every machine in prod instead of i" [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [19:15:59] (03PS2) 10Clare Ming: Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607) [19:16:54] (03CR) 10Eevans: [C: 03+1] aqs: allow Kubernetes nodes access to cassandra [puppet] - 10https://gerrit.wikimedia.org/r/793839 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [19:17:09] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@5a4803a]: T307983: zero-pad dates within @dailysnapshot [19:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:16] T307983: Write search index data for image suggestions into a hive table rather than local hdfs files - https://phabricator.wikimedia.org/T307983 [19:17:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:18:03] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/797423 (https://phabricator.wikimedia.org/T303515) (owner: 10Kosta Harlan) [19:19:30] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@5a4803a]: T307983: zero-pad dates within @dailysnapshot (duration: 02m 20s) [19:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:51] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [19:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:17] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [19:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:36] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [19:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:03] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [19:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:58] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [19:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:57] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:29:08] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [19:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:36] (03CR) 10Majavah: [V: 03+1] nrpe: manage sudo rules via nrpe::check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [19:40:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2105.codfw.wmnet with reason: Maintenance [19:40:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2105.codfw.wmnet with reason: Maintenance [19:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 6 hosts with reason: Maintenance [19:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 6 hosts with reason: Maintenance [19:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:13] 10SRE, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10SCherukuwada) Manager is OOO. Skip-level Manager here, approved (if needed). [19:46:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:46:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [19:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298560)', diff saved to https://phabricator.wikimedia.org/P28366 and previous config saved to /var/cache/conftool/dbconfig/20220523-194659-ladsgroup.json [19:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:06] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [20:00:05] RoanKattouw, Urbanecm, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T2000) [20:00:06] James_F, DannyS712, koi, PeterBowman, and zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:29] hey o/ [20:00:33] hi [20:00:38] hello [20:01:58] hi - I can deploy [20:02:23] James_F: are you around? [20:03:02] DannyS712: are you around? [20:03:49] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:04:21] (03PS2) 10Clare Ming: commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407) (owner: 10Stang) [20:05:24] koi: I'll start with your patch since the folks ahead of you haven't responded yet [20:05:31] ok [20:06:15] (03CR) 10Clare Ming: [C: 03+2] commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407) (owner: 10Stang) [20:06:22] cjming: Sorry, yes, arround. [20:06:23] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:06:35] I was expecting to deploy myself, but you can go ahead if you wish. :-) [20:07:04] James_F: sorry about that! i'll let you self-serve after i get this first patch off - i'll ping you when i'm done [20:07:12] Sure! [20:07:13] (KubernetesRsyslogDown) firing: (3) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:07:33] (03Merged) 10jenkins-bot: commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793766 (https://phabricator.wikimedia.org/T300407) (owner: 10Stang) [20:08:31] koi: can your patch be verified on mwdebug1001? [20:08:43] I think so, looking [20:08:44] 1 sec - forgot to rebase [20:09:00] koi: ok you can check now [20:09:56] cjming: You should deploy everyone else's changes before mine, so I don't hold anyone else up. [20:10:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance [20:10:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance [20:10:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance [20:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance [20:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:28] James_F: alrighty - should be done here quick [20:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:41] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10bking) [20:10:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:10:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:19] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) [20:11:22] koi: gtg? [20:11:30] still testing [20:11:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:06] cjming, LGTM [20:12:13] great - syncing [20:13:11] PeterBowman: can you rebase your patch? i tried thru gerrit but it seems to need a manual rebase [20:13:16] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:793766|commonswiki: Enable wgCopyUploadAllowOnWikiDomainConfig (T300407)]] (duration: 00m 52s) [20:13:19] sure [20:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:21] T300407: Allow managing upload-by-url allowlist as a system message - https://phabricator.wikimedia.org/T300407 [20:13:33] koi: your patch should be live [20:14:01] cjming I need some time, can you please continue with other patches in the meantime? I also need to log out [20:14:15] PeterBowman: sure - np [20:14:21] see you soon [20:14:29] Zabe: your next [20:14:37] *you're [20:14:46] ok [20:14:51] (03CR) 10Clare Ming: [C: 03+2] Start writing to cuc_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797312 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:15:01] (03PS2) 10Clare Ming: Start writing to cuc_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797312 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:16:32] cjming, you need to re +2 it, in order to kick the gate-and-submit job again since you rebased it after giving the +2 [20:16:51] (03CR) 10Clare Ming: [C: 03+2] Start writing to cuc_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797312 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:16:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:17:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:09] (03Merged) 10jenkins-bot: Start writing to cuc_actor in test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797312 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:18:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:44] Zabe: is your change testable? on mwdebug1001 [20:19:04] (03PS4) 10Peter Bowman: Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) [20:20:06] cjming, lgtm. It's only test wikis so making sure that editing doesn't fatal should be enough. [20:20:12] sounds good - syncing [20:20:39] (03PS5) 10Clare Ming: Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) (owner: 10Peter Bowman) [20:21:17] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:797312|Start writing to cuc_actor in test wikis (T233004)]] (duration: 00m 50s) [20:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:23] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:21:27] Zabe: your patch is live [20:21:35] thanks :) [20:21:52] (03CR) 10Clare Ming: [C: 03+2] Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) (owner: 10Peter Bowman) [20:22:01] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@aa49833]: increase memory_overhead for convert_to_esbulk [20:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:25] (03PS3) 10Zabe: Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004) [20:23:29] (03Merged) 10jenkins-bot: Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) (owner: 10Peter Bowman) [20:23:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:25] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@aa49833]: increase memory_overhead for convert_to_esbulk (duration: 02m 24s) [20:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:37] PeterBowman: can you check mwdebug1001? [20:24:53] sorry, first time here, how can I access that? [20:25:23] I found instructions to ssh, but this is an interface change [20:25:55] PeterBowman: You need to use a browser extension to get your browser to read the production wikis using mwdebug1001 rather than a regular server. [20:26:02] PeterBowman: Don't worry about it, I can validate. [20:26:06] there's a browser extension WikimediaDebug that allows you to check changes on the server [20:26:24] thanks @James_F [20:26:40] oops, I'll remember that for the next time :| thank you James_F [20:26:43] cjming: And yes, it's working. [20:26:48] cool - syncing then [20:26:52] PeterBowman: No worries. It's all a bit too complicated, frankly. [20:27:50] !log cjming@deploy1002 Synchronized static/images/mobile/copyright/wiktionary-wordmark-pl.svg: Config: [[gerrit:789613|Add localized wordmark for plwiktionary (T307683)]] (duration: 00m 50s) [20:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:56] T307683: Add localized wordmark for plwiktionary - https://phabricator.wikimedia.org/T307683 [20:27:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:27:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:44] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:789613|Add localized wordmark for plwiktionary (T307683)]] (duration: 00m 51s) [20:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:51] PeterBowman: James_F: change should be live [20:28:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:29:01] yes I see it, thank you all! :) [20:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:57] James_F: go ahead with your patches -- can you let me know when you're done? i have a config change I want to do as well (not quite ready yet) [20:30:03] Sure! [20:30:11] (03PS3) 10Jforrester: Drop CodeReview, Part I: Stop loading it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593350 (https://phabricator.wikimedia.org/T116948) [20:30:21] (03CR) 10Jforrester: [C: 03+2] "The time is nigh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593350 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester) [20:30:32] DannyS712: if/when you're here, lmk and we can do your patch [20:31:02] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) [20:32:05] (03Merged) 10jenkins-bot: Drop CodeReview, Part I: Stop loading it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593350 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester) [20:32:30] (03CR) 10Eevans: [C: 03+1] Enable cassandra encryption (aqs cluster) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [20:33:41] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) [20:34:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:04] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) [20:34:12] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:593350|Drop CodeReview, Part I: Stop loading it anywhere (T116948)]] (duration: 00m 51s) [20:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:18] T116948: Undeploy CodeReview - https://phabricator.wikimedia.org/T116948 [20:34:41] (03PS3) 10Jforrester: Drop CodeReview, Part II: Stop configuring it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593351 (https://phabricator.wikimedia.org/T116948) [20:34:45] (03CR) 10Jforrester: [C: 03+2] Drop CodeReview, Part II: Stop configuring it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593351 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester) [20:34:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:34:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:13] (03Merged) 10jenkins-bot: Drop CodeReview, Part II: Stop configuring it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593351 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester) [20:37:04] (03PS3) 10Jforrester: Drop CodeReview, Part III: Drop from i18n build step [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593352 (https://phabricator.wikimedia.org/T116948) [20:37:24] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:593351|Drop CodeReview, Part II: Stop configuring it anywhere (T116948)]] (duration: 00m 51s) [20:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:34] (03CR) 10Jforrester: [C: 03+2] Drop CodeReview, Part III: Drop from i18n build step [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593352 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester) [20:39:05] (03Merged) 10jenkins-bot: Drop CodeReview, Part III: Drop from i18n build step [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593352 (https://phabricator.wikimedia.org/T116948) (owner: 10Jforrester) [20:40:12] !log jforrester@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:593352|Drop CodeReview, Part III: Drop from i18n build step (T116948)]] (duration: 00m 51s) [20:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:17] T116948: Undeploy CodeReview - https://phabricator.wikimedia.org/T116948 [20:40:36] cjming: OK, all done! [20:40:44] great - thanks [20:40:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:00] (03CR) 10Clare Ming: [C: 03+2] Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [20:41:06] (03PS3) 10Clare Ming: Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607) [20:41:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:41:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:50] (03CR) 10Clare Ming: [C: 03+2] Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [20:42:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:58] (03Merged) 10jenkins-bot: Deploy TOC A/B test to frwiki, ptwiki at 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797424 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [20:44:12] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frbackup2002 - https://phabricator.wikimedia.org/T306842 (10Jgreen) [20:46:06] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@2f7ddb1]: increase driver memory_overhead for convert_to_esbulk [20:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:31] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:797424|Deploy TOC A/B test to frwiki, ptwiki at 50% (T306607)]] (duration: 00m 52s) [20:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:35] T306607: Deploy ToC A/B test to remainder of desktop improvements pilot wikis - https://phabricator.wikimedia.org/T306607 [20:47:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:26] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@2f7ddb1]: increase driver memory_overhead for convert_to_esbulk (duration: 02m 20s) [20:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:48:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:03] !log end of UTC late backport window [20:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] Reedy, sbassett, Maryum, and manfredi: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T2100). [21:00:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:26] (03CR) 10Dzahn: nrpe: manage sudo rules via nrpe::check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797422 (owner: 10Majavah) [21:03:26] (03CR) 10Yahya: [C: 03+1] Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil) [21:03:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Ejegg) [21:03:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance [21:03:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance [21:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298555)', diff saved to https://phabricator.wikimedia.org/P28367 and previous config saved to /var/cache/conftool/dbconfig/20220523-210339-ladsgroup.json [21:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:45] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [21:04:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:04:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:59] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:23:12] (03CR) 10Cwhite: [C: 03+1] Set fixed uid/gid for kafka by default [puppet] - 10https://gerrit.wikimedia.org/r/797127 (https://phabricator.wikimedia.org/T296982) (owner: 10Elukey) [21:23:48] https://phabricator.wikimedia.org 503ing for me [21:23:49] 503 [21:23:54] gah you beat me by like a sec [21:24:02] for wikis aswell [21:24:04] 503 Service Unavailable :P [21:24:08] (03CR) 10Cwhite: [C: 03+1] alerts: take rule file site into consideration when deploying [puppet] - 10https://gerrit.wikimedia.org/r/797237 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [21:24:11] yeah, mw.org is down for me [21:24:16] that's a #page [21:24:26] thanks, looking [21:24:47] thanks, looking [21:24:49] its fine I didn't want to look at phab anyway /s [21:24:52] phab is up for me (oregon) [21:24:57] :( [21:25:03] phab is working. here [21:25:11] not for me (europe) [21:25:14] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/797201 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [21:25:17] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:17] Not for me UK [21:25:19] (ProbeDown) firing: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:25:19] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:25:27] jouncebot: now [21:25:27] For the next 1 hour(s) and 34 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220523T2100) [21:25:29] jinxer-wm: 2slow4me [21:25:35] not for me, Eastern U.S. Nor enwiki Main Page [21:25:37] is this a deployment ? ^ [21:25:47] enwiki is down for me, ticket.wm too [21:26:01] PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:26:18] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:26:19] PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:26:19] PROBLEM - Apache HTTP on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:26:19] PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:26:19] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:26:21] PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:26:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:26:37] PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:27:11] PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:27:13] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:27:15] PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:27:17] PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:27:19] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 18.08 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:27:37] PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 233 bytes in 0.329 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:27:40] PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string Picture of the day not found on https://commons.wikimedia.org:443/wiki/Main_Page - 233 bytes in 0.005 second response time https://phabricator.wikimedia.org/project/view/1118/ [21:27:41] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [21:27:42] PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:27:55] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:27:55] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:27:59] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp107 [21:27:59] wmnet are marked down but pooled: testlb6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1087.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1089.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:28:00] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [21:28:01] PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:28:02] PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:28:02] PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:28:02] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp501 [21:28:02] wmnet are marked down but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wi [21:28:05] don't think its a deployment, but can't check SAL [21:28:20] PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:28:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 6.209 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:28:23] Down here in UK [21:28:25] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:28:30] PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:28:33] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [21:28:34] PROBLEM - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:28:37] PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:28:44] PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:28:58] PROBLEM - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:28:58] PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/Debmonitor [21:28:59] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [21:29:06] PROBLEM - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:29:09] PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:29:09] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp305 [21:29:09] wmnet are marked down but pooled: testlb6_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3058.esams.wmnet, cp3062.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:29:14] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:29:14] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are ma [21:29:14] n but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5007.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:29:15] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 233 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:29:22] PROBLEM - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:29:35] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.9722 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:29:37] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [21:29:43] PROBLEM - PHP7 rendering on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:29:43] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1367.eqiad.wmnet, mw1414.eqiad.wmnet, mw1332.eqiad.wmnet, mw1371.eqiad.wmnet, mw1455.eqiad.wmnet, mw1442.eqiad.wmnet, mw1395.eqiad.wmnet, mw1434.eqiad.wmnet, mw1322.eqiad.wmnet, mw1355.eqiad.wmnet, mw1323.eqiad.wmnet, mw1384.eqiad.wmnet, mw1454.eqiad.wmnet, mw1327.eqiad.wmnet, mw1328.eqiad.wmnet, mw1413.eqiad.wmnet, mw [21:29:43] ad.wmnet, mw1393.eqiad.wmnet, mw1351.eqiad.wmnet, mw1416.eqiad.wmnet, mw1405.eqiad.wmnet, mw1352.eqiad.wmnet, mw1432.eqiad.wmnet, mw1441.eqiad.wmnet, mw1333.eqiad.wmnet, mw1326.eqiad.wmnet, mw1435.eqiad.wmnet, mw1420.eqiad.wmnet, mw1418.eqiad.wmnet, mw1319.eqiad.wmnet, mw1407.eqiad.wmnet, mw1366.eqiad.wmnet, mw1324.eqiad.wmnet, mw1372.eqiad.wmnet, mw1391.eqiad.wmnet, mw1370.eqiad.wmnet, mw1429.eqiad.wmnet, mw1451.eqiad.wmnet, mw1331.eqiad [21:29:43] mw1401.eqiad.wmnet, mw1397.eqiad.wmnet, mw1325.eqiad.wmnet, mw1403.eqiad.wmnet, mw1373.eqiad.wmnet, mw1385.eqiad.wmnet, mw1369.eqiad.wmnet, mw1419.eqiad.wmnet, mw1387.eqiad.wmnet, mw135 https://wikitech.wikimedia.org/wiki/PyBal [21:29:47] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1433.eqiad.wmnet, mw1365.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1366.eqiad.wmnet, mw1432.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1387.eqiad.wmnet, mw1430.eqiad.wmnet, mw1416.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw [21:29:47] ad.wmnet, mw1420.eqiad.wmnet, mw1333.eqiad.wmnet, mw1393.eqiad.wmnet, mw1454.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, mw1395.eqiad.wmnet, mw1397.eqiad.wmnet, mw1325.eqiad.wmnet, mw1385.eqiad.wmnet, mw1417.eqiad.wmnet, mw1367.eqiad.wmnet, mw1373.eqiad.wmnet, mw1455.eqiad.wmnet, mw1436.eqiad.wmnet, mw1332.eqiad.wmnet, mw1452.eqiad.wmnet, mw1414.eqiad.wmnet, mw1369.eqiad.wmnet, mw1371.eqiad.wmnet, mw1453.eqiad [21:29:47] mw1322.eqiad.wmnet, mw1319.eqiad.wmnet, mw1323.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1351.eqiad.wmnet, mw1391.eqiad.wmnet, mw1352.eqiad.wmnet, mw144 https://wikitech.wikimedia.org/wiki/PyBal [21:30:07] PROBLEM - PHP7 rendering on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:30:12] RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18860 bytes in 1.082 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:30:13] RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 182039 bytes in 0.012 second response time https://phabricator.wikimedia.org/project/view/1118/ [21:30:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:30:19] RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:30:21] RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.457 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:30:22] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 39622 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Phabricator [21:30:23] RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18873 bytes in 0.589 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:30:25] RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:30:25] RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:30:33] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:30:44] RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.drmrs.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 0.545 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:30:49] RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:30:51] RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:30:52] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [21:30:54] RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:31:00] (Wikidata Reliability Metrics - Median Payload alert) firing: Wikidata Reliability Metrics - Median Payload alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [21:31:00] RECOVERY - LVS text-https drmrs port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.drmrs.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18859 bytes in 0.523 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:31:03] RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:31:08] RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18860 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:31:13] RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:31:15] RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:31:15] RECOVERY - Apache HTTP on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:31:15] RECOVERY - Apache HTTP on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:31:15] RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:31:22] RECOVERY - LVS text-https ulsfo port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18860 bytes in 0.472 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:31:24] (JobUnavailable) firing: (3) Reduced availability for job swagger_check_restbase_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:31:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:31:29] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [21:31:33] RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1634 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [21:31:35] RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:31:35] RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:31:40] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:31:45] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:31:50] RECOVERY - LVS text-https esams port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18884 bytes in 0.536 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:31:51] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18859 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:32:07] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [21:32:09] RECOVERY - PHP7 rendering on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:32:09] RECOVERY - Apache HTTP on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:32:11] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06944 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:32:12] (ProbeDown) firing: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:32:13] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:32:13] back for me [21:32:13] RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:32:15] RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:32:17] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:32:21] yep i'm fine [21:32:21] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:32:27] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 72.07 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:32:31] RECOVERY - PHP7 rendering on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:33:21] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 95.86 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:33:29] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [21:33:51] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 4.451e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [21:33:55] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:33:55] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 4.337e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [21:34:05] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5014 is CRITICAL: 4.6e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [21:34:11] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: 4.627e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [21:35:04] RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18872 bytes in 6.163 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:35:09] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5008 is CRITICAL: 5.23e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [21:35:11] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:35:25] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:35:47] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [21:36:16] RECOVERY - LVS text-https eqsin port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18859 bytes in 1.306 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:36:20] (JobUnavailable) firing: (4) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:36:31] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:36:57] (ProbeDown) firing: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:37:02] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:37:30] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:37:31] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5008 is OK: (C)5000 gt (W)3000 gt 320.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [21:38:31] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 408.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [21:38:35] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 386.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [21:38:43] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5014 is OK: (C)5000 gt (W)3000 gt 293.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [21:38:49] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: (C)5000 gt (W)3000 gt 379.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [21:38:55] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:39:45] (JobUnavailable) resolved: (4) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:40:19] (ProbeDown) resolved: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:40:19] (ProbeDown) resolved: (21) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:40:36] what just happened? [21:40:39] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host relforge1003.eqiad.wmnet with OS bullseye [21:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:10] Amir1: a few minutes of downtime :P [21:41:50] addshore: according to alert it was only one minute :D [21:42:05] Interesting, looks like that makes https://www.wikimediastatus.net automatically add an "Errors for many users" incident [21:43:14] it was definetly more than a minute [21:43:18] !log [cumin1001:~] $ sudo systemctl start httpbb_hourly_appserver [21:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:28] zabe: about 8 or so [21:44:25] mutante: so it was `PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service` ? [21:45:31] * addshore goes back to what he was doing [21:45:43] TheresNoTime: that is failing for an unrelated reason [21:45:51] ah (: [21:46:02] * TheresNoTime has forgotten what they were doing now [21:46:23] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:46:29] TheresNoTime: it's because https://www.mediawiki.org/w/index.php?title=Special:CodeReview&path=foo is 404 and not 302 (those redirects for CodeReview) [21:47:57] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:48:03] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) Ryan Kemper T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:48:03] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) Ryan Kemper T308770 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:48:04] as in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/593352/3/wmf-config/extension-list ? [21:48:11] (03CR) 10Cwhite: [C: 03+2] logstash: set dlq output and template_version [puppet] - 10https://gerrit.wikimedia.org/r/777888 (https://phabricator.wikimedia.org/T305088) (owner: 10Cwhite) [21:49:32] (Wikidata Reliability Metrics - Median Payload alert) resolved: Wikidata Reliability Metrics - Median Payload alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+Payload+alert [21:50:46] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1003.eqiad.wmnet with reason: host reimage [21:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:53] TheresNoTime: sounds likely. but whatever it says on https://phabricator.wikimedia.org/T205361 afaict [21:54:35] legoktm: is it expected that Special:Code is gone? [21:54:40] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1003.eqiad.wmnet with reason: host reimage [21:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:57] mutante: yes, James_F undeployed it earlier. Some but not all of the redirects are in place [21:55:29] e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/774943/ [21:55:30] legoktm: it triggered alerts because we did not remove the tests before actually undeploying it. I will fix that now though [21:55:34] it does redirect on mediawiki, but not e.g., enwiki [21:55:37] thanks! [21:55:50] perryprog: Special:Code never existed on any other wiki besides mw.o [21:55:58] 🤦‍♂️ ah [21:56:11] mutante: you can tag any patches with T116948 [21:56:11] T116948: Undeploy CodeReview - https://phabricator.wikimedia.org/T116948 [21:56:51] one rule is about Special:Code but others are about Special:CodeReview [22:01:54] (03PS1) 10Dzahn: httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) [22:02:32] I can access English atm [22:05:42] The redirects from mw.o/Special:Code to static-codereview.wikimedia.org should still work, so when the tests are alerting that means that the tests are not completly correct or that they depend on https://gerrit.wikimedia.org/r/c/operations/puppet/+/774943 or some other fix [22:07:55] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: relocating_shards: 0, status: green, number_of_nodes: 2, cluster_name: relforge-eqiad-small-alpha, delayed_unassigned_shards: 0, initializing_shards: 0, timed_out: False, active_shards_percent_as_number: 100.0, unassigned_shards: 0, active_primary_shards: 37, task_max_waiting_in_queue_millis: 0, number_of_p [22:07:55] asks: 0, active_shards: 42, number_of_in_flight_fetch: 0, number_of_data_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:10:33] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:10:39] zabe: 2 of them failed. and they are: [22:10:56] Status code: expected 302, got 404. - https://www.mediawiki.org/w/index.php?title=Special:CodeReview&path=foo [22:11:17] Status code: expected 302, got 404. - https://www.mediawiki.org/w/index.php?title=Special:Code&path=foo [22:11:43] zabe: amending https://gerrit.wikimedia.org/r/c/operations/puppet/+/797533/1/modules/profile/files/httpbb/appserver/test_main.yaml [22:12:45] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1003.eqiad.wmnet with OS bullseye [22:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:46] zabe: or not. just adding reviewers.. heh [22:14:31] (03CR) 10Jcrespo: [C: 04-1] "I belive the static version stays, only the mw extension has to be removed?- but someone else here should confirm." [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn) [22:16:25] what would you say about only removing the two failing ones for now? It seems like the rewrite rules need some tweaking in order to work for cases aswell. [22:16:28] (03CR) 10Jcrespo: [C: 04-1] httpbb: remove tests for undeployed CodeReview extension (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn) [22:17:41] BTW, what was the cause of the earlier outage, since the httpbb failures were unrelated to that? I was looking for follow-up on it but didn't see any. [22:19:16] (03PS4) 10Zabe: Start writing to cuc_actor in s3, kcgwiki and labtestwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/797294 (https://phabricator.wikimedia.org/T233004) [22:24:54] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:25:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:35:44] (03PS1) 10Jdlrobson: mediawiki.skinning: `transition-duration` accessibility override set to `0` [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/797219 (https://phabricator.wikimedia.org/T308979) [22:37:30] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10aaron) Note that MYSQLI_OPT_READ_TIMEOUT can only be set once per https://bugs.php.net/bug.php?id=76703 [22:39:30] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:41:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance [22:41:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance [22:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298555)', diff saved to https://phabricator.wikimedia.org/P28368 and previous config saved to /var/cache/conftool/dbconfig/20220523-224119-ladsgroup.json [22:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:27] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [22:53:13] (KubernetesRsyslogDown) firing: (5) rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:06:42] (03CR) 10Jforrester: "Hmm. These were meant to have been adjusted so they wouldn't alert when the extension was undeployed, because they were asserting that the" [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn) [23:08:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298555)', diff saved to https://phabricator.wikimedia.org/P28369 and previous config saved to /var/cache/conftool/dbconfig/20220523-230851-ladsgroup.json [23:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:58] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [23:15:53] (03PS2) 10Dzahn: httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) [23:16:27] (03CR) 10Dzahn: "amended. now only removing what _actually_ fails currently. that was line 78 and line 82." [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn) [23:16:38] (03CR) 10Jforrester: [C: 03+1] httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn) [23:17:04] (03CR) 10Dzahn: httpbb: remove tests for undeployed CodeReview extension (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn) [23:17:24] (03CR) 10Dzahn: [C: 03+2] httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn) [23:18:36] (03CR) 10Dzahn: [V: 03+2 C: 03+2] httpbb: remove tests for undeployed CodeReview extension [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn) [23:20:00] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:46] !log cumin1001 - systemtl start httpbb_hourly_appserver after deploying gerrit:797533 leads to '+icinga-wm> RECOVERY - Check systemd state on cumin1001 is OK: OK" T116948 [23:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:58] T116948: Undeploy CodeReview - https://phabricator.wikimedia.org/T116948 [23:21:19] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "manually started: <+icinga-wm> RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wi" [puppet] - 10https://gerrit.wikimedia.org/r/797533 (https://phabricator.wikimedia.org/T116948) (owner: 10Dzahn) [23:22:40] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:23:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28370 and previous config saved to /var/cache/conftool/dbconfig/20220523-232357-ladsgroup.json [23:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:43] (03CR) 10Krinkle: Add "db-mainstash" entry to $wgObjectCaches (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [23:32:06] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:39:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28371 and previous config saved to /var/cache/conftool/dbconfig/20220523-233902-ladsgroup.json [23:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:31] (03CR) 10Krinkle: [C: 04-1] "blocked on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/789725 as indeed currently a missing localLB is "fixed" by service wiring via" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [23:47:38] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@02f2375]: increase driver jvm heap for convert_to_esbulk [23:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:56] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@02f2375]: increase driver jvm heap for convert_to_esbulk (duration: 02m 18s) [23:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298555)', diff saved to https://phabricator.wikimedia.org/P28372 and previous config saved to /var/cache/conftool/dbconfig/20220523-235407-ladsgroup.json [23:54:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance [23:54:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance [23:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:13] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [23:54:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298555)', diff saved to https://phabricator.wikimedia.org/P28373 and previous config saved to /var/cache/conftool/dbconfig/20220523-235415-ladsgroup.json [23:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:11] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 (resolved, follow-up pending) - https://phabricator.wikimedia.org/T308940 (10Dzahn) 05Open→03Resolved a:03Dzahn https://wikitech.wikimedia.org/wiki/Incidents/2022-05-21_-_varnish_cache_busting [23:56:22] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 96 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 170, active_shards: 211, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 94, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, numbe [23:56:22] flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.72964169381108 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:59:16] I repeatedly receive email notification from Gerrit (V+2, CR+2 etc.) about a already merged patch.. is this some problem from my side? [23:59:29] *several