[00:10:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1174.eqiad.wmnet with reason: Maintenance [00:10:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1174.eqiad.wmnet with reason: Maintenance [00:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298555)', diff saved to https://phabricator.wikimedia.org/P28206 and previous config saved to /var/cache/conftool/dbconfig/20220521-001014-ladsgroup.json [00:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:21] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [00:14:19] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:23:49] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:33:55] PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 7009 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [00:33:57] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:06:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298555)', diff saved to https://phabricator.wikimedia.org/P28207 and previous config saved to /var/cache/conftool/dbconfig/20220521-010626-ladsgroup.json [01:06:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [01:06:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [01:06:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:33] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [01:06:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298555)', diff saved to https://phabricator.wikimedia.org/P28208 and previous config saved to /var/cache/conftool/dbconfig/20220521-010640-ladsgroup.json [01:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:49] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 33.07 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:25] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 42.83 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:10:31] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:10:41] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:11:21] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:16:25] RECOVERY - Disk space on an-master1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [01:38:43] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [01:40:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:45] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:01:02] anyone else getting database errors? https://usercontent.irccloud-cdn.com/file/sTIsMliq/image.png [02:01:42] doesn't happen when loading my own log, though [02:04:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298555)', diff saved to https://phabricator.wikimedia.org/P28209 and previous config saved to /var/cache/conftool/dbconfig/20220521-020449-ladsgroup.json [02:04:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [02:04:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [02:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:56] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [02:04:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28210 and previous config saved to /var/cache/conftool/dbconfig/20220521-020457-ladsgroup.json [02:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:11] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:11:29] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:25:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_navigationtiming_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:45] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [03:42:01] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [04:00:09] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:01:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:35] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:17:15] Tamzin: Is it a user that you expect might have a lot of log entries [04:17:40] I would recommend filing a task tagged with DBA if you want someone to look at it [04:18:56] #19 all-time per [04:19:15] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:19:19] his log does load fine now, on the third attempt [04:26:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298555)', diff saved to https://phabricator.wikimedia.org/P28211 and previous config saved to /var/cache/conftool/dbconfig/20220521-042650-ladsgroup.json [04:26:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:26:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:58] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [04:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298555)', diff saved to https://phabricator.wikimedia.org/P28212 and previous config saved to /var/cache/conftool/dbconfig/20220521-042700-ladsgroup.json [04:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:36:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:23:51] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:30:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:36:19] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:41:31] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220521T0700) [07:18:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298555)', diff saved to https://phabricator.wikimedia.org/P28213 and previous config saved to /var/cache/conftool/dbconfig/20220521-071828-ladsgroup.json [07:18:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:18:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:35] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [07:18:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T298555)', diff saved to https://phabricator.wikimedia.org/P28214 and previous config saved to /var/cache/conftool/dbconfig/20220521-071836-ladsgroup.json [07:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:41] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:17:43] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:24:17] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:31:59] RECOVERY - Check for large files in client bucket on an-launcher1002 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [08:35:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298555)', diff saved to https://phabricator.wikimedia.org/P28215 and previous config saved to /var/cache/conftool/dbconfig/20220521-083533-ladsgroup.json [08:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:39] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [08:48:42] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Alexey_Skripnik) Oh wow, this was open for more than a year ago. Why it hasn't been done yet? 1. There is a consensus among editors that ot sho... [09:00:11] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Reedy) We don't use LocalSettings.php on Wikimedia wikis ;) [09:11:56] 10SRE, 10Gerrit: Icinga Check SSL might have a time based race condition - https://phabricator.wikimedia.org/T308908 (10hashar) [09:25:21] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:31:19] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:35:15] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:36:19] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:48:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:48:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:50:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:52:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [09:52:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [09:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:16] 10SRE, 10Gerrit: Icinga Check SSL might have a time based race condition - https://phabricator.wikimedia.org/T308908 (10RhinosF1) {T293826} maybe? [10:05:42] hashar: try restarting apache if it's flapping [10:07:13] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 7 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:09:25] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:18:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [10:18:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [10:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:25] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:42:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance [10:42:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance [10:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T303603)', diff saved to https://phabricator.wikimedia.org/P28216 and previous config saved to /var/cache/conftool/dbconfig/20220521-104247-ladsgroup.json [10:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:55] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [10:46:17] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:11:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T303603)', diff saved to https://phabricator.wikimedia.org/P28217 and previous config saved to /var/cache/conftool/dbconfig/20220521-111138-ladsgroup.json [11:11:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:11:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:45] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:11:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T303603)', diff saved to https://phabricator.wikimedia.org/P28218 and previous config saved to /var/cache/conftool/dbconfig/20220521-111146-ladsgroup.json [11:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:09] (03PS1) 10Gergő Tisza: Remove 'required' from callbackIsPrefix [extensions/OAuth] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793795 (https://phabricator.wikimedia.org/T308880) [11:31:18] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:36:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T303603)', diff saved to https://phabricator.wikimedia.org/P28219 and previous config saved to /var/cache/conftool/dbconfig/20220521-114318-ladsgroup.json [11:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [11:43:24] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [11:43:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [11:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [11:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [11:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:31] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:51:07] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:59:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:59:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298555)', diff saved to https://phabricator.wikimedia.org/P28220 and previous config saved to /var/cache/conftool/dbconfig/20220521-115919-ladsgroup.json [11:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:23] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [12:09:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [12:09:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [12:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T303603)', diff saved to https://phabricator.wikimedia.org/P28221 and previous config saved to /var/cache/conftool/dbconfig/20220521-120926-ladsgroup.json [12:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:33] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:20:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [12:20:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [12:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T306560)', diff saved to https://phabricator.wikimedia.org/P28222 and previous config saved to /var/cache/conftool/dbconfig/20220521-122023-ladsgroup.json [12:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:30] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [12:22:39] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:22:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T306560)', diff saved to https://phabricator.wikimedia.org/P28223 and previous config saved to /var/cache/conftool/dbconfig/20220521-122241-ladsgroup.json [12:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T303603)', diff saved to https://phabricator.wikimedia.org/P28224 and previous config saved to /var/cache/conftool/dbconfig/20220521-124124-ladsgroup.json [12:41:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:41:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:31] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:15] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:08:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [13:08:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [13:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [13:34:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [13:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T303603)', diff saved to https://phabricator.wikimedia.org/P28225 and previous config saved to /var/cache/conftool/dbconfig/20220521-133431-ladsgroup.json [13:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:38] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:40:11] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:44:45] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:53:33] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T303603)', diff saved to https://phabricator.wikimedia.org/P28226 and previous config saved to /var/cache/conftool/dbconfig/20220521-140512-ladsgroup.json [14:05:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [14:05:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [14:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:19] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:05:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T303603)', diff saved to https://phabricator.wikimedia.org/P28227 and previous config saved to /var/cache/conftool/dbconfig/20220521-140520-ladsgroup.json [14:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298555)', diff saved to https://phabricator.wikimedia.org/P28228 and previous config saved to /var/cache/conftool/dbconfig/20220521-141918-ladsgroup.json [14:19:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [14:19:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [14:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:24] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [14:19:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298555)', diff saved to https://phabricator.wikimedia.org/P28229 and previous config saved to /var/cache/conftool/dbconfig/20220521-141926-ladsgroup.json [14:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db1105.eqiad.wmnet with reason: Maintenance [14:28:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db1105.eqiad.wmnet with reason: Maintenance [14:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28230 and previous config saved to /var/cache/conftool/dbconfig/20220521-142836-ladsgroup.json [14:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:43] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [14:35:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T303603)', diff saved to https://phabricator.wikimedia.org/P28231 and previous config saved to /var/cache/conftool/dbconfig/20220521-143459-ladsgroup.json [14:35:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [14:35:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [14:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:06] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:35:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T303603)', diff saved to https://phabricator.wikimedia.org/P28232 and previous config saved to /var/cache/conftool/dbconfig/20220521-143507-ladsgroup.json [14:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T303603)', diff saved to https://phabricator.wikimedia.org/P28233 and previous config saved to /var/cache/conftool/dbconfig/20220521-150549-ladsgroup.json [15:05:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [15:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [15:05:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:05:54] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [15:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T303603)', diff saved to https://phabricator.wikimedia.org/P28234 and previous config saved to /var/cache/conftool/dbconfig/20220521-150602-ladsgroup.json [15:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:39] thcipriani: jnuche: hello! I have a CentralAuth revert https://gerrit.wikimedia.org/r/793797 that I'd like to backport ASAP to prevent local accounts getting out of sync from the central databases. requesting your approval per https://wikitech.wikimedia.org/wiki/Deployments/Emergencies [15:27:09] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:27:45] (03PS1) 10Majavah: Revert "Populate rq_wiki with the wiki where the rename was requested" [extensions/CentralAuth] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793798 (https://phabricator.wikimedia.org/T308895) [15:43:52] (03PS1) 10Stang: itwiki: Add "editautopatrolprotected" protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/794590 (https://phabricator.wikimedia.org/T308917) [15:44:29] PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 1194 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [16:06:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T303603)', diff saved to https://phabricator.wikimedia.org/P28235 and previous config saved to /var/cache/conftool/dbconfig/20220521-160616-ladsgroup.json [16:06:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [16:06:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [16:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:24] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [16:06:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T303603)', diff saved to https://phabricator.wikimedia.org/P28236 and previous config saved to /var/cache/conftool/dbconfig/20220521-160624-ladsgroup.json [16:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:13] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:36:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T303603)', diff saved to https://phabricator.wikimedia.org/P28237 and previous config saved to /var/cache/conftool/dbconfig/20220521-163631-ladsgroup.json [16:36:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [16:36:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [16:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:36] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [16:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T303603)', diff saved to https://phabricator.wikimedia.org/P28238 and previous config saved to /var/cache/conftool/dbconfig/20220521-163639-ladsgroup.json [16:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298555)', diff saved to https://phabricator.wikimedia.org/P28239 and previous config saved to /var/cache/conftool/dbconfig/20220521-164805-ladsgroup.json [16:48:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:12] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [16:48:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:48:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 12 hosts with reason: Maintenance [16:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 12 hosts with reason: Maintenance [16:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:45] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:01:16] (03PS2) 10Krinkle: Make IS.php return an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793869 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [17:01:24] (03PS9) 10Krinkle: Make CommonSettings load the array from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (owner: 10Ladsgroup) [17:01:38] (03PS10) 10Krinkle: Update CommonSettings to use array return from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793871 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [17:01:45] (03PS13) 10Krinkle: Move out ORES extension configuration out of InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [17:02:05] (03PS14) 10Krinkle: Move ORES settings from InitialiseSettings.php to ext-ORES.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [17:06:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T303603)', diff saved to https://phabricator.wikimedia.org/P28240 and previous config saved to /var/cache/conftool/dbconfig/20220521-170638-ladsgroup.json [17:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:45] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [17:09:21] RECOVERY - Disk space on an-master1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [17:09:51] (03PS2) 10Krinkle: Remove wgPriorityHints and wgPriorityHintsRatio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98) [17:10:00] (03CR) 10Krinkle: [C: 03+1] "Good to go anytime." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793402 (https://phabricator.wikimedia.org/T308707) (owner: 10Mainframe98) [17:15:34] * Krinkle testing out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/749762 on mwdebug1002, cc Amir1 [17:16:17] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:16:22] Thanks. I'm hoping on train right now [17:17:42] Amir1: taavi is looking for an sre to approve a deploy [17:18:09] Is it labs only? [17:18:19] Amir1: no, emergency [17:18:25] Oh [17:18:28] see -releng and a while back up [17:18:30] Let me see [17:21:30] (03CR) 10Zabe: [C: 03+1] Revert "Populate rq_wiki with the wiki where the rename was requested" [extensions/CentralAuth] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793798 (https://phabricator.wikimedia.org/T308895) (owner: 10Majavah) [17:25:27] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2253 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:27:22] (03CR) 10Krinkle: "cherr-pick on mwdebug1002, meh:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [17:27:41] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 302 Found - 550 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:27:57] (03PS5) 10Krinkle: wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) [17:31:55] (03CR) 10Krinkle: [C: 03+2] wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [17:34:11] (03Merged) 10jenkins-bot: wmf-config: Move loading/computing of $globals to a method for profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749762 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [17:34:13] (03PS4) 10Krinkle: MWConfigCacheGenerator: Move siteFromDB() deeper down the stack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749763 (https://phabricator.wikimedia.org/T169821) [17:37:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:38:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:35] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 91 probes of 665 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:39:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:10] !log krinkle@deploy1002 Synchronized multiversion/: I97878f8e6fdd5cf (duration: 00m 51s) [17:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 54 probes of 665 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:47:18] taavi: I'm around now, if you want to do the deploy [17:47:36] sure! [17:47:43] Krinkle: still deploying or can we go ahead? [17:49:00] (03CR) 10Majavah: [C: 03+2] Revert "Populate rq_wiki with the wiki where the rename was requested" [extensions/CentralAuth] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793798 (https://phabricator.wikimedia.org/T308895) (owner: 10Majavah) [17:49:29] taavi: go ahead [17:49:43] thanks [17:51:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:52:01] (03Merged) 10jenkins-bot: Revert "Populate rq_wiki with the wiki where the rename was requested" [extensions/CentralAuth] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793798 (https://phabricator.wikimedia.org/T308895) (owner: 10Majavah) [17:53:49] testing on mwdebug [17:57:53] syncing [17:58:25] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:58:50] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.12/extensions/CentralAuth: Backport: [[gerrit:793798|Revert "Populate rq_wiki with the wiki where the rename was requested" (T308895)]] (duration: 00m 51s) [17:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:56] T308895: GlobalRename not renaming some accounts - https://phabricator.wikimedia.org/T308895 [17:59:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:00:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:58] taavi: can you confirm if this fixed the issue? [18:04:27] Amir1: new rename requests are no longer broken, I'm just fixing the ones already in the queue [18:04:36] cool [18:06:06] !log set rq_wiki = null for 26 rows in centralauth.renameuser_queue status table T308895 [18:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:10] T308895: GlobalRename not renaming some accounts - https://phabricator.wikimedia.org/T308895 [18:06:27] Amir1: the queue now works properly again [18:07:13] we should consider dropping the support of local renames through the rename queue [18:07:39] Awesome, I'll send an email to them [18:08:51] now I just need to fix the 47 accounts whose renames were approved but actually weren't renamed [18:10:06] we don't have global renames in the action api? :( [18:11:42] ad-hoc maintenance script it is then I guess [18:12:27] taavi: let someone review it beforehand, specially testing it in beta cluster would be really appreciated [18:12:51] Amir1: will do [18:17:27] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:39:55] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Alexey_Skripnik) Ok, I didn't use that for Wikimedia wikis you use different kind of settings. But I assume it works like that: there is a variab... [18:42:42] zabe: Amir1: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/794675/ [18:42:52] works fine locally, didn't test on beta yet [18:46:47] looking [18:59:25] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:01:27] getting 503s [19:01:40] everything is slow from here [19:01:43] +1 [19:01:43] logstash is down too? [19:01:46] * taavi klaxons [19:01:50] yup [19:01:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:01:59] i can't access logstash, VRTS [19:01:59] unusual 503s [19:02:02] https://wikitech.wikimedia.org/wiki down [19:02:15] down here [19:02:18] (ProbeDown) firing: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:02:19] (ProbeDown) firing: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:02:30] taavi: are you paging? [19:02:49] cos thats a #page [19:02:50] klaxon says it went [19:02:56] plus an auto page [19:03:00] I received it. [19:03:01] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [19:03:06] yep, although looks like the automated monitoring caught it too now [19:03:16] I got your page specifically, taavi. [19:03:57] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [19:04:07] confirmed slow loading or failure to load (el.wp, logged in user) , via esams [19:04:08] PROBLEM - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 3313 bytes in 1.805 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:04:19] here too [19:04:20] I wonder if the "firing: Too many messages in kafka logging" alert is related to logstash not loading [19:04:23] checking [19:04:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1121.eqiad.wmnet with reason: Maintenance [19:04:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1121.eqiad.wmnet with reason: Maintenance [19:04:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:38] seems to be recovering [19:04:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298555)', diff saved to https://phabricator.wikimedia.org/P28241 and previous config saved to /var/cache/conftool/dbconfig/20220521-190446-ladsgroup.json [19:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:51] eswiki works for me [19:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:00] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [19:05:08] back up for me to [19:05:14] stuff's now loading on my side as well [19:05:20] same [19:05:27] yep [19:05:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:06:01] caught me in the shower but here now :) godog: do you need a hand? [19:06:09] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [19:06:22] RECOVERY - LVS text-https eqiad port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18859 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:06:38] I am around [19:06:40] rzl: yeah! thank you, see _security [19:07:03] ack [19:07:19] (ProbeDown) resolved: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:07:19] (ProbeDown) resolved: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:07:22] most of the page's haven't been acked [19:07:34] 2/4 [19:08:01] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [19:08:52] Is total request volume supposed to be spiking on https://www.wikimediastatus.net? [19:09:20] ah, _security [19:10:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:10:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [19:11:16] 10SRE, 10Traffic: All wikis down: error 400 - https://phabricator.wikimedia.org/T308940 (10AlexisJazz) [19:11:40] 10SRE, 10Traffic: All wikis down: error 400 - https://phabricator.wikimedia.org/T308940 (10RhinosF1) This should be resolved now [19:12:11] 10SRE, 10Traffic: All wikis down: error 400 - https://phabricator.wikimedia.org/T308940 (10AlexisJazz) [19:13:28] 10SRE, 10Traffic: All wikis down: error 400 - https://phabricator.wikimedia.org/T308940 (10AlexisJazz) >>! In T308940#7947028, @RhinosF1 wrote: > This should be resolved now I tried to report it sooner, but Phabricator was down! [19:13:38] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 400 - https://phabricator.wikimedia.org/T308940 (10Aklapper) [19:17:07] (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro) [19:17:44] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 - https://phabricator.wikimedia.org/T308940 (10Krinkle) [19:19:24] (03PS3) 10Abijeet Patro: Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) [19:19:32] (03CR) 10jerkins-bot: [V: 04-1] Add namespaces to Punjabi wikisource default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro) [19:20:01] 10SRE: Klaxon redirects to http://klaxon.wikimedia.org (not https) - https://phabricator.wikimedia.org/T308941 (10Majavah) [19:20:48] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 - https://phabricator.wikimedia.org/T308940 (10AlexisJazz) [19:30:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [19:33:08] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 - https://phabricator.wikimedia.org/T308940 (10Dzahn) Wikis are back up.. This incident is actively being investigated. [19:34:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 6.987 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:34:12] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 - https://phabricator.wikimedia.org/T308940 (10Krinkle) > Edit: sorry the 400 was due to a typo of mine. [Reporting_a_connectivity_issue](https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue) maybe shouldn't use www.wikime... [19:34:23] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 5.497 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:34:49] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 45.21 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:34:50] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 19.03 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:35:09] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 14.26 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:35:13] (03CR) 10Samwalton: "I assume you meant to add Sam Wilson 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro) [19:36:22] (03PS1) 10Majavah: Add a script to fix T308895 renames [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793800 (https://phabricator.wikimedia.org/T308895) [19:36:29] (03CR) 10jerkins-bot: [V: 04-1] Add a script to fix T308895 renames [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793800 (https://phabricator.wikimedia.org/T308895) (owner: 10Majavah) [19:36:47] (03CR) 10Majavah: "recheck" [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793800 (https://phabricator.wikimedia.org/T308895) (owner: 10Majavah) [19:37:23] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:37:43] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:09] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:41:52] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 - https://phabricator.wikimedia.org/T308940 (10AlexisJazz) [19:43:09] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Fuzzy) Why not using `$wgMaxArticleSize` as a limit for the page raw size, and `2*$wgMaxArticleSize` as limit for the page post-expand include si... [19:43:55] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 98.28 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:43:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 99.57 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:44:24] (03PS2) 10Krinkle: Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE)) [19:44:33] (03CR) 10jerkins-bot: [V: 04-1] Remove meaningless restriction level "none" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE)) [19:45:29] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:46:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:50:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [19:52:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:57:19] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:57:25] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:59:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:02:49] (03CR) 10Zabe: [C: 03+1] Add a script to fix T308895 renames [extensions/WikimediaMaintenance] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/793800 (https://phabricator.wikimedia.org/T308895) (owner: 10Majavah) [20:03:25] 10SRE: Klaxon redirects to http://klaxon.wikimedia.org (not https) - https://phabricator.wikimedia.org/T308941 (10Legoktm) a:03Legoktm [20:04:19] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:06:40] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 - https://phabricator.wikimedia.org/T308940 (10Dzahn) p:05Triage→03High [20:06:57] PROBLEM - Check systemd state on ms-be1066 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:38] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 - https://phabricator.wikimedia.org/T308940 (10Dzahn) Status High but not UBN anymore. We will follow-up with an incident report but currently no ongoing outage. [20:08:35] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:09:04] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 (resolved, follow-up pending) - https://phabricator.wikimedia.org/T308940 (10Dzahn) [20:10:49] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.212 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:11:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:12:41] (03PS1) 10Legoktm: Use ProxyFix middleware to correctly recognize HTTPS usage [software/klaxon] - 10https://gerrit.wikimedia.org/r/794759 (https://phabricator.wikimedia.org/T308941) [20:12:47] (03CR) 10jerkins-bot: [V: 04-1] Use ProxyFix middleware to correctly recognize HTTPS usage [software/klaxon] - 10https://gerrit.wikimedia.org/r/794759 (https://phabricator.wikimedia.org/T308941) (owner: 10Legoktm) [20:16:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:23:53] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:26:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:27:43] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:28:47] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:29:40] 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Legoktm) [20:29:55] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.848 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:30:35] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:31:03] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.583 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:31:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:33:03] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:33:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:34:14] 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Legoktm) I copied the deployment checklist that I used for shellbox (T281423) and pas... [20:35:13] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.547 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:35:19] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:35:43] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10Dzahn) [20:37:29] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.706 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:37:30] 10SRE, 10Sustainability (Incident Followup): get a legend for haproxy "anomalous session termination states" - https://phabricator.wikimedia.org/T308952 (10Dzahn) [20:38:18] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:38:55] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:40:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:40:50] still here, looking [20:41:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:42:09] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:43:19] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:44:25] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:45:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:46:18] (ProbeDown) firing: (3) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:46:39] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.333 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:46:39] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.498 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:50:09] 10SRE, 10Traffic, 10Wikimedia-Incident: All wikis down: error 503 (resolved, follow-up pending) - https://phabricator.wikimedia.org/T308940 (10Dzahn) This was a failure at the edge / caching layer. All services behind it were not directly affected but appeared down / received no traffic. beta cluster was not... [20:51:19] (ProbeDown) resolved: (3) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:53:29] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:53:49] (ProbeDown) firing: (2) Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:55:41] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.550 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:58:48] (ProbeDown) resolved: (2) Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:01:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:01:29] RECOVERY - Check systemd state on ms-be1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:40] 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): Klaxon redirects to http://klaxon.wikimedia.org (not https) - https://phabricator.wikimedia.org/T308941 (10Dzahn) [21:06:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:10:32] 10SRE, 10Performance-Team, 10Wikimedia-Site-requests, 10Russian-Sites: Increase $wgMaxArticleSize to 4MB for ruwikisource - https://phabricator.wikimedia.org/T308893 (10Urbanecm) Tagging with the same tags as {T275319}. This will require approval from #performance-team at least. [21:16:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:19:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:09] (03PS9) 10MdsShakil: Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) [21:23:37] (03PS10) 10Samtar: Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil) [21:23:45] (03CR) 10jerkins-bot: [V: 04-1] Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil) [21:29:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:31:18] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Alexey_Skripnik) > More memory consumed everytime it is interacted with by the software. Isn't it cached after being generated once? In Wikisour... [21:34:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:36:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:36:40] (03PS11) 10MdsShakil: Remove patrol rights from autoconfirmed users and create patroller user group on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) [21:40:03] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:41:19] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:42:39] Can anyone please check this patch? Bot called Merge Failed, that was confusing for me. [21:42:41] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/793790/ [21:43:19] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:43:35] MdsShakil: its a known problem at the moment [21:43:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298555)', diff saved to https://phabricator.wikimedia.org/P28242 and previous config saved to /var/cache/conftool/dbconfig/20220521-214338-ladsgroup.json [21:43:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1141.eqiad.wmnet with reason: Maintenance [21:43:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1141.eqiad.wmnet with reason: Maintenance [21:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:45] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [21:43:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298555)', diff saved to https://phabricator.wikimedia.org/P28243 and previous config saved to /var/cache/conftool/dbconfig/20220521-214346-ladsgroup.json [21:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:50] MdsShakil: https://phabricator.wikimedia.org/T308943 [21:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:49] @p858snak oh! I was not aware this issue, thanks [21:47:15] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Alexey_Skripnik) Another consideration: the alternative to a page with 4 MB of text is not a page with 1 MB of text, but rather 4 pages with 1 MB... [21:48:19] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:48:55] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:50:29] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:53:19] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:53:21] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.608 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:54:53] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.823 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:57:05] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:59:11] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:06:35] PROBLEM - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is CRITICAL: CRITICAL - Certificate gerrit.wikimedia.org expires in 6 day(s) (Sat 28 May 2022 08:33:22 PM GMT +0000). https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [22:10:51] !log Restarted Zuul CI server due to stall ssh connections which went against the max per user connection limit in Gerrit # T308943 [22:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:57] T308943: CI fails with 'This change or one of its cross-repo dependencies was unable to be automatically merged' for a lot of repos - https://phabricator.wikimedia.org/T308943 [22:11:05] RECOVERY - Gerrit Health Check SSL Expiry on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Wed 27 Jul 2022 08:27:52 PM GMT +0000. https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [22:13:19] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:17:58] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793790 (https://phabricator.wikimedia.org/T308945) (owner: 10MdsShakil) [22:18:19] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:24:19] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:25:56] (03CR) 10Legoktm: "recheck" [software/klaxon] - 10https://gerrit.wikimedia.org/r/794759 (https://phabricator.wikimedia.org/T308941) (owner: 10Legoktm) [22:29:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:28] (03CR) 10AntiCompositeNumber: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776200 (owner: 10Thiemo Kreuz (WMDE)) [22:52:29] (03CR) 10AntiCompositeNumber: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793799 (https://phabricator.wikimedia.org/T287887) (owner: 10Abijeet Patro) [23:11:00] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:35:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298560)', diff saved to https://phabricator.wikimedia.org/P28244 and previous config saved to /var/cache/conftool/dbconfig/20220521-233556-ladsgroup.json [23:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:03] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [23:51:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28245 and previous config saved to /var/cache/conftool/dbconfig/20220521-235102-ladsgroup.json [23:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log