[00:02:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298555)', diff saved to https://phabricator.wikimedia.org/P28246 and previous config saved to /var/cache/conftool/dbconfig/20220522-000225-ladsgroup.json [00:02:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:02:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:31] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [00:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28247 and previous config saved to /var/cache/conftool/dbconfig/20220522-000607-ladsgroup.json [00:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28248 and previous config saved to /var/cache/conftool/dbconfig/20220522-002112-ladsgroup.json [00:21:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1146.eqiad.wmnet with reason: Maintenance [00:21:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1146.eqiad.wmnet with reason: Maintenance [00:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:17] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [00:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28249 and previous config saved to /var/cache/conftool/dbconfig/20220522-002120-ladsgroup.json [00:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:07] PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 5938 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [00:36:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:03] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:51] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1003.wikimedia.org.service,rsync-data-backup-gitlab1003.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:33] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:13:21] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:15:41] RECOVERY - Disk space on an-master1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [01:24:38] DannyS712: FYI T308927 T308943 [01:24:38] T308927: quibble-vendor-mysql-php72-selenium-docker: "cannot create directory ‘log’: Permission denied" - https://phabricator.wikimedia.org/T308927 [01:24:39] T308943: CI fails with 'This change or one of its cross-repo dependencies was unable to be automatically merged' for a lot of repos - https://phabricator.wikimedia.org/T308943 [01:26:17] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:40:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:02:09] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:05] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:13:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance [02:13:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance [02:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:33] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:41] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 6 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [03:03:57] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [03:06:15] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:28:39] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:22:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1142.eqiad.wmnet with reason: Maintenance [04:22:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1142.eqiad.wmnet with reason: Maintenance [04:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298555)', diff saved to https://phabricator.wikimedia.org/P28250 and previous config saved to /var/cache/conftool/dbconfig/20220522-042249-ladsgroup.json [04:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:56] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [04:57:32] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:20:14] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:52] PROBLEM - MariaDB Replica Lag: s7 #page on db1127 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1361.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:37:19] mmmm [05:37:29] schema change? [05:37:35] checking [05:38:33] 👋 [05:38:37] quite a day [05:39:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P28251 and previous config saved to /var/cache/conftool/dbconfig/20220522-053905-marostegui.json [05:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:13] Depooled just in case [05:40:00] Yes, it was a schema change [05:40:15] The host came out from downtime earlier than expect [05:40:24] Amir1: please adjust the downtime, this can page again [05:40:25] ahh that'll do it [05:40:39] need anything, or are you all set? [05:40:48] rzl: it is ok, you can go back to your life [05:40:55] thanks for showing up [05:40:57] thanks <3 have a good morning! [05:43:24] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:43:43] I am going to leave the host depooled as I don't have time to wait for it to catch up and then repool it, i might do later today or tomorrow during work hours [05:43:47] Amir1: ^ [05:44:42] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:03] (03PS1) 10Marostegui: db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/794808 [05:48:04] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/794808 (owner: 10Marostegui) [05:48:25] (03CR) 10Marostegui: [C: 03+2] db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/794808 (owner: 10Marostegui) [05:49:14] I am going back to my life [05:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:52:30] mmm I think it wasn't the schema change but some sort of storage problem, the raid has many errors [05:54:45] Created this: https://phabricator.wikimedia.org/T308965 [06:11:08] 10SRE, 10Bengali-Sites, 10User-Urbanecm, 10Wiki-Setup (Create): Create a new wiki for Wikimedia Bangladesh - https://phabricator.wikimedia.org/T33096 (10Ahmad_Kanik) [06:29:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:36:55] (03PS1) 10KartikMistry: Update cxserver to 2022-05-22-062659-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/794890 (https://phabricator.wikimedia.org/T290847) [06:42:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298555)', diff saved to https://phabricator.wikimedia.org/P28252 and previous config saved to /var/cache/conftool/dbconfig/20220522-064232-ladsgroup.json [06:42:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1143.eqiad.wmnet with reason: Maintenance [06:42:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1143.eqiad.wmnet with reason: Maintenance [06:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:40] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [06:42:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28253 and previous config saved to /var/cache/conftool/dbconfig/20220522-064240-ladsgroup.json [06:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220522T0700) [07:02:31] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:08:23] marostegui: morning. Let me see. I made the downtime to 16 hours last time [07:09:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:11:38] The schema change on it was finished yesterday 6 am from what I'm seeing [07:16:10] Amir1: see above, it is storage related [07:16:51] I know I just wanted to make sure I didn't mess up anything on top [07:21:08] Thankfully it seems I didn't 😁 [07:42:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298555)', diff saved to https://phabricator.wikimedia.org/P28254 and previous config saved to /var/cache/conftool/dbconfig/20220522-074255-ladsgroup.json [07:42:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1148.eqiad.wmnet with reason: Maintenance [07:42:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1148.eqiad.wmnet with reason: Maintenance [07:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:02] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298555)', diff saved to https://phabricator.wikimedia.org/P28255 and previous config saved to /var/cache/conftool/dbconfig/20220522-074303-ladsgroup.json [07:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:29] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:03:41] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:11:45] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:21:27] (03PS3) 10KartikMistry: Enable ContentTranslation as default for cs, el, he, ko and tr WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793444 (https://phabricator.wikimedia.org/T298239) [08:40:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:40:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:40:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T303603)', diff saved to https://phabricator.wikimedia.org/P28256 and previous config saved to /var/cache/conftool/dbconfig/20220522-084036-ladsgroup.json [08:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:47] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [08:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28257 and previous config saved to /var/cache/conftool/dbconfig/20220522-085056-ladsgroup.json [08:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:02] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [09:06:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28258 and previous config saved to /var/cache/conftool/dbconfig/20220522-090601-ladsgroup.json [09:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T303603)', diff saved to https://phabricator.wikimedia.org/P28259 and previous config saved to /var/cache/conftool/dbconfig/20220522-090811-ladsgroup.json [09:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:19] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:21:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28260 and previous config saved to /var/cache/conftool/dbconfig/20220522-092106-ladsgroup.json [09:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P28261 and previous config saved to /var/cache/conftool/dbconfig/20220522-092317-ladsgroup.json [09:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:29] (03PS1) 10Majavah: Separate metricsinfra nodes from prometheus_nodes on cloud [puppet] - 10https://gerrit.wikimedia.org/r/795143 [09:34:08] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35466/console" [puppet] - 10https://gerrit.wikimedia.org/r/795143 (owner: 10Majavah) [09:35:06] (03PS2) 10Majavah: Separate metricsinfra nodes from prometheus_nodes on cloud [puppet] - 10https://gerrit.wikimedia.org/r/795143 [09:36:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28262 and previous config saved to /var/cache/conftool/dbconfig/20220522-093611-ladsgroup.json [09:36:13] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35467/console" [puppet] - 10https://gerrit.wikimedia.org/r/795143 (owner: 10Majavah) [09:36:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:36:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:17] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [09:36:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28263 and previous config saved to /var/cache/conftool/dbconfig/20220522-093619-ladsgroup.json [09:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:03] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 6 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:38:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P28264 and previous config saved to /var/cache/conftool/dbconfig/20220522-093822-ladsgroup.json [09:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:23] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:53:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T303603)', diff saved to https://phabricator.wikimedia.org/P28265 and previous config saved to /var/cache/conftool/dbconfig/20220522-095327-ladsgroup.json [09:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:34] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [09:54:49] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:04:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298555)', diff saved to https://phabricator.wikimedia.org/P28266 and previous config saved to /var/cache/conftool/dbconfig/20220522-100429-ladsgroup.json [10:04:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1149.eqiad.wmnet with reason: Maintenance [10:04:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1149.eqiad.wmnet with reason: Maintenance [10:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:34] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [10:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298555)', diff saved to https://phabricator.wikimedia.org/P28267 and previous config saved to /var/cache/conftool/dbconfig/20220522-100436-ladsgroup.json [10:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:56] (03PS4) 10Majavah: metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) [10:10:58] (03PS1) 10Majavah: P:metricsinfra::alertmanager: proxy access for trusted projects [puppet] - 10https://gerrit.wikimedia.org/r/795192 (https://phabricator.wikimedia.org/T304716) [10:14:20] (03CR) 10jerkins-bot: [V: 04-1] P:metricsinfra::alertmanager: proxy access for trusted projects [puppet] - 10https://gerrit.wikimedia.org/r/795192 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [10:14:37] (03CR) 10jerkins-bot: [V: 04-1] metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) (owner: 10Majavah) [10:19:09] (03PS2) 10Majavah: P:metricsinfra::alertmanager: proxy access for trusted projects [puppet] - 10https://gerrit.wikimedia.org/r/795192 (https://phabricator.wikimedia.org/T304716) [10:19:10] (03PS5) 10Majavah: metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) [10:26:11] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:26:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:28:49] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:30:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48108 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:30:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 25 Jun 2022 07:55:09 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:31:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:41:33] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 6 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:43:51] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:56:01] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:19:13] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:50:08] (03PS1) 10Majavah: openstack::designate: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795356 (https://phabricator.wikimedia.org/T297268) [11:50:10] (03PS1) 10Majavah: openstack::neutron: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795357 (https://phabricator.wikimedia.org/T297268) [11:50:12] (03PS1) 10Majavah: openstack::nova: enable TLS encryption for rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/795358 (https://phabricator.wikimedia.org/T297268) [11:52:46] (03PS1) 10Majavah: openstack::trove: enable rabbitmq tls for api [puppet] - 10https://gerrit.wikimedia.org/r/795361 (https://phabricator.wikimedia.org/T297268) [11:55:07] (03PS1) 10Majavah: cloudweb2002-dev is not behind LVS [puppet] - 10https://gerrit.wikimedia.org/r/795365 [11:56:23] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35468/console" [puppet] - 10https://gerrit.wikimedia.org/r/795365 (owner: 10Majavah) [11:57:36] (03PS1) 10Majavah: Revert "Horizon: include openstack bpos on cloudweb hosts" [puppet] - 10https://gerrit.wikimedia.org/r/795249 [11:57:45] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:58:46] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35469/console" [puppet] - 10https://gerrit.wikimedia.org/r/795249 (owner: 10Majavah) [12:16:51] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:20:29] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:24:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298555)', diff saved to https://phabricator.wikimedia.org/P28269 and previous config saved to /var/cache/conftool/dbconfig/20220522-122402-ladsgroup.json [12:24:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:24:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:09] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [12:24:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298555)', diff saved to https://phabricator.wikimedia.org/P28270 and previous config saved to /var/cache/conftool/dbconfig/20220522-122410-ladsgroup.json [12:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:58] (03PS1) 10Majavah: apt::repository: use signed-by instead of apt-key [puppet] - 10https://gerrit.wikimedia.org/r/795380 [12:26:15] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35470/console" [puppet] - 10https://gerrit.wikimedia.org/r/795380 (owner: 10Majavah) [12:36:51] PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 3679 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [12:58:09] RECOVERY - Disk space on an-master1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [12:58:55] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:49] * Krinkle testing on mwdebug1002 [13:14:55] (03CR) 10Krinkle: [C: 03+2] MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [13:15:39] (03Merged) 10jenkins-bot: MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [13:17:16] !log krinkle@deploy1002 scap failed: average error rate on 7/8 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [13:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:50] !log krinkle@deploy1002 Scap failed!: 7/8 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. [13:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:20:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:49] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:37] !log krinkle@deploy1002 Synchronized wmf-config/CommonSettings.php: I97878f8e6 (duration: 00m 50s) [13:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:58] !log krinkle@deploy1002 Synchronized multiversion/: I3759179dba75a9419 (duration: 00m 53s) [13:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:29] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:36:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:40:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:27] (03PS11) 10Krinkle: Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [13:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:55:45] (03CR) 10Krinkle: [C: 03+2] Make IS.php return an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793869 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [13:56:33] (03Merged) 10jenkins-bot: Make IS.php return an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793869 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [13:59:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:55] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I31b1bfb1808b9523 (duration: 00m 52s) [14:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:03:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:04] (03CR) 10Krinkle: [C: 03+2] Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [14:04:06] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: include openstack bpos on cloudweb hosts" [puppet] - 10https://gerrit.wikimedia.org/r/795249 (owner: 10Majavah) [14:04:15] (03CR) 10jerkins-bot: [V: 04-1] Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [14:05:22] (03PS12) 10Krinkle: Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [14:05:29] (03CR) 10Krinkle: [C: 03+2] Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [14:06:52] (03Merged) 10jenkins-bot: Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [14:07:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:52] !log krinkle@deploy1002 Synchronized multiversion/: Ia0a6d4794faaafcb (1/2) (duration: 00m 50s) [14:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:14:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:32] !log krinkle@deploy1002 scap failed: average error rate on 3/8 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [14:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:53] (03CR) 10Andrew Bogott: [C: 03+2] "Correct, wikitech-static is in ORD now" [dns] - 10https://gerrit.wikimedia.org/r/793729 (https://phabricator.wikimedia.org/T155761) (owner: 10Volans) [14:18:03] we will see an increase in 500x [14:18:06] that's fine [14:18:10] it'll recover [14:18:23] !log krinkle@deploy1002 Synchronized wmf-config/: Ia0a6d4794faaafcb (2/2) (duration: 00m 42s) [14:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:23] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:02] !log krinkle@deploy1002 Synchronized docroot/noc/: Ia0a6d4794faaafc (duration: 00m 50s) [14:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:23] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:27:55] !log krinkle@deploy1002 Synchronized src/: Ia0a6d4794faaafc (duration: 00m 50s) [14:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:17] (03Restored) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [14:34:24] (03Restored) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [14:34:35] (03PS2) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 [14:34:43] (03Abandoned) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788416 (owner: 10Winston Sung) [14:34:53] (03PS4) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 [14:34:58] (03Abandoned) 10Winston Sung: Rearrange zh-related fallbacks and zh/zh-* translations, aliases in mediawiki/core [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788606 (owner: 10Winston Sung) [14:37:19] (03Restored) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [14:37:26] (03PS2) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 [14:37:30] (03Abandoned) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788417 (owner: 10Winston Sung) [14:37:33] (03Restored) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [14:37:40] (03PS2) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 [14:37:52] (03Restored) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [14:37:59] (03PS3) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 [14:38:04] (03Abandoned) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/788418 (owner: 10Winston Sung) [14:38:13] (03Restored) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (https://phabricator.wikimedia.org/T299377) (owner: 10Winston Sung) [14:38:26] (03PS3) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 [14:38:32] (03Abandoned) 10Winston Sung: Revert "Temporarily disable yue language fallback tests" [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788610 (owner: 10Winston Sung) [14:46:49] (03CR) 10Winston Sung: "The Depends-on has been abandoned, please abandon this change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775423 (https://phabricator.wikimedia.org/T273578) (owner: 10Func) [14:46:57] (03CR) 10Andrew Bogott: [C: 03+2] openstack,admin_script: ran black and isort [puppet] - 10https://gerrit.wikimedia.org/r/789117 (https://phabricator.wikimedia.org/T305909) (owner: 10David Caro) [14:48:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298555)', diff saved to https://phabricator.wikimedia.org/P28272 and previous config saved to /var/cache/conftool/dbconfig/20220522-144847-ladsgroup.json [14:48:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1138.eqiad.wmnet with reason: Maintenance [14:48:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1138.eqiad.wmnet with reason: Maintenance [14:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:54] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [14:48:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298555)', diff saved to https://phabricator.wikimedia.org/P28273 and previous config saved to /var/cache/conftool/dbconfig/20220522-144855-ladsgroup.json [14:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:24] (03CR) 10Andrew Bogott: [C: 03+1] cloudvirt-libvirt-stats: Avoid printing to stdout [puppet] - 10https://gerrit.wikimedia.org/r/790388 (owner: 10David Caro) [14:54:52] (03PS1) 10Stang: zhwiki: Enable RCPatrol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/795526 (https://phabricator.wikimedia.org/T308976) [15:35:47] (03Abandoned) 10Func: Use variants fallback to define logos for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775423 (https://phabricator.wikimedia.org/T273578) (owner: 10Func) [16:11:49] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:48:23] (03Abandoned) 10Winston Sung: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788608 (owner: 10Winston Sung) [17:01:23] (03CR) 10Krinkle: [C: 03+1] "Through machine translation, I understand the consensus." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/795526 (https://phabricator.wikimedia.org/T308976) (owner: 10Stang) [17:14:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298555)', diff saved to https://phabricator.wikimedia.org/P28274 and previous config saved to /var/cache/conftool/dbconfig/20220522-171444-ladsgroup.json [17:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:49] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [17:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:05:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28275 and previous config saved to /var/cache/conftool/dbconfig/20220522-180506-ladsgroup.json [18:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:13] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [18:20:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28276 and previous config saved to /var/cache/conftool/dbconfig/20220522-182011-ladsgroup.json [18:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28277 and previous config saved to /var/cache/conftool/dbconfig/20220522-183516-ladsgroup.json [18:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:32] seems like wikibugs died an hour ago and is still not back [18:50:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28278 and previous config saved to /var/cache/conftool/dbconfig/20220522-185021-ladsgroup.json [18:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:28] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [18:59:37] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:46:52] PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 4040 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [19:47:55] <_joe_> o/ [19:48:33] * jbond here, looking [19:48:56] here. so there was recent change to switch mx alerting. looking if that was merged [19:49:13] ah, no. not merged yet [19:49:39] 👋 [19:49:41] 2399 118MB 43h 0m tools.wmflabs.org [19:54:14] is it libup bot again? [19:55:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:56:37] zabe: yea [19:57:37] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:58:49] ACKNOWLEDGEMENT - exim queue #page on mx1001 is CRITICAL: CRITICAL: 4040 mails in exim queue. daniel_zahn https://phabricator.wikimedia.org/T306295 https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [20:00:53] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:02:34] * Krinkle testing on mwdebug1002 [20:18:10] RECOVERY - exim queue #page on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [20:29:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:34] !log krinkle@deploy1002 Synchronized src/XhguiSaverPdo.php: I3882be35572 (duration: 00m 50s) [20:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:42] !log krinkle@deploy1002 Synchronized wmf-config/profiler.php: I3882be35572 (duration: 00m 51s) [20:32:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:02] !log krinkle@deploy1002 Synchronized lib/: I3882be35572 (duration: 00m 50s) [20:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:34] !log krinkle@deploy1002 Synchronized src/Profiler.php: I14c5a9aa39 (duration: 00m 49s) [20:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:36] !log krinkle@deploy1002 Synchronized wmf-config/: I14c5a9aa39 (duration: 00m 50s) [20:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:43:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:24] * Krinkle done testing [21:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:20:59] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:30:19] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 46.87 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:46:47] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 56.37 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:48:29] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:49:07] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 81.26 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [23:52:13] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:55:53] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state