[00:00:30] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:00:58] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:02:46] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.367 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [00:03:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:06:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:07:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:07:58] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.541 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:08:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P26234 and previous config saved to /var/cache/conftool/dbconfig/20220423-000839-ladsgroup.json [00:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26235 and previous config saved to /var/cache/conftool/dbconfig/20220423-001058-ladsgroup.json [00:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:16:46] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:19:42] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:18] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.099 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [00:23:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:23:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26236 and previous config saved to /var/cache/conftool/dbconfig/20220423-002344-ladsgroup.json [00:23:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [00:23:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [00:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26237 and previous config saved to /var/cache/conftool/dbconfig/20220423-002352-ladsgroup.json [00:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26238 and previous config saved to /var/cache/conftool/dbconfig/20220423-002603-ladsgroup.json [00:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:52] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:27:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:29:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:29:26] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:31:18] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:31:38] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.647 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:35:58] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:37:06] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:37:22] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:06] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [00:39:14] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:39:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:39:40] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.892 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:41:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26239 and previous config saved to /var/cache/conftool/dbconfig/20220423-004108-ladsgroup.json [00:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:00] PROBLEM - Check systemd state on mw1438 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:42:50] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:44:58] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:46:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [00:46:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [00:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26240 and previous config saved to /var/cache/conftool/dbconfig/20220423-004617-ladsgroup.json [00:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:22] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [00:47:18] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:47:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:47:28] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:49:08] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26241 and previous config saved to /var/cache/conftool/dbconfig/20220423-004935-ladsgroup.json [00:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:52:04] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.787 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [00:56:02] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [00:56:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26242 and previous config saved to /var/cache/conftool/dbconfig/20220423-005613-ladsgroup.json [00:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [00:56:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:56:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:56:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [00:56:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [00:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [00:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:52] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:59:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [00:59:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [00:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:01:48] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:03:26] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:03:38] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:04:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26243 and previous config saved to /var/cache/conftool/dbconfig/20220423-010440-ladsgroup.json [01:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:42] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:06:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:06:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:07:58] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:10:06] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:11:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:30] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:13:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:14:32] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:14:42] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:16:58] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:16:58] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.207 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:18:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:19:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26244 and previous config saved to /var/cache/conftool/dbconfig/20220423-011945-ladsgroup.json [01:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:27:18] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:29:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:29:26] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:30:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:30:32] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:32:56] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:33:28] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [01:33:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26245 and previous config saved to /var/cache/conftool/dbconfig/20220423-013336-ladsgroup.json [01:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:34:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [01:34:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [01:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26246 and previous config saved to /var/cache/conftool/dbconfig/20220423-013450-ladsgroup.json [01:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:54] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [01:35:02] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:35:08] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.112 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:39:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:39:56] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:16] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.820 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:44:33] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:45:12] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 22.4 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:45:12] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:45:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:45:24] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 14.26 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:45:36] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 13.11 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:45:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:52] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:47:22] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:47:24] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.985 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:48:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P26247 and previous config saved to /var/cache/conftool/dbconfig/20220423-014841-ladsgroup.json [01:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:46] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:49:00] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:49:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:49:48] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 100.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:50:00] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:50:12] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:51:24] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:53:16] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.968 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:53:30] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:53:36] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.230 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:53:46] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:55:56] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.437 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:56:30] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.236 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:58:20] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:59:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:00:02] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:00:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:00:32] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.359 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:00:40] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:01:12] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:03:22] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:03:34] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:03:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P26248 and previous config saved to /var/cache/conftool/dbconfig/20220423-020346-ladsgroup.json [02:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:10] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:05:18] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:07:00] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.457 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:07:34] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.642 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:08:04] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:08:10] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:09:34] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:12:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [02:12:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [02:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26249 and previous config saved to /var/cache/conftool/dbconfig/20220423-021211-ladsgroup.json [02:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:16] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:12:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:12:40] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:14:28] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.143 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:14:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:16:10] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:18:18] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:18:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [02:18:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [02:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T306560)', diff saved to https://phabricator.wikimedia.org/P26250 and previous config saved to /var/cache/conftool/dbconfig/20220423-021826-ladsgroup.json [02:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:30] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [02:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26251 and previous config saved to /var/cache/conftool/dbconfig/20220423-021851-ladsgroup.json [02:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:57] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:19:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:20:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:21:48] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:23:52] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:24:22] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:24:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:25:36] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:26:34] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.474 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:26:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:27:46] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:28:30] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:29:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:29:52] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:30:40] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:30:42] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.462 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:34:26] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.422 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:32] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:36:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:40] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:40:08] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:40:10] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:41:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:42:28] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:43:46] PROBLEM - LVS jobrunner eqiad port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.eqiad.wmnet IPv4 #page on jobrunner.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:44:35] hey, looking [02:44:40] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:44:46] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.518 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:45:40] here as well [02:45:56] RECOVERY - LVS jobrunner eqiad port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.eqiad.wmnet IPv4 #page on jobrunner.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 400 bytes in 1.234 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:46:12] I think this is requeueing from https://phabricator.wikimedia.org/T306697 but I don't have a ton of context, trying to catch up [02:46:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:46:31] rzl: nod, looking at that task [02:47:11] Around but it recovered? [02:47:46] it did but it's been flapping for long enough that I want to make sure it's stable [02:49:01] ffmpeg is still maxing CPU, unsurprisingly [02:49:08] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:49:10] Sigh [02:49:26] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:49:34] the job queue should still be okay though, since we have some jobrunner hosts depooled from videoscaling so other jobs aren't getting starved out [02:50:19] Yeah. The only thing we can do is to wait for it to stabilize I guess [02:50:58] yeah, I wish we didn't fail healthchecks in that state though :/ wonder if there's already a task for it [02:51:14] do we have a way to view the queue depth? [02:51:24] nice if we could leave one core free, or something [02:51:41] There is a grafana dashboard [02:51:42] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.463 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:52:20] rzl: yeah leave on cpu pinned to other tasks would be a good idea [02:52:25] *one [02:52:59] How you can os to do that? [02:53:16] I think making it nicer would help [02:53:43] *can tell [02:53:52] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.643 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:54:02] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.298 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:54:55] I go rest though. 5am here 👋🤦 [02:55:10] Amir1: night night, thanks [02:55:20] Anything needed can wait until Monday [02:56:18] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:56:21] yeah I'm pretty sure there's nothing we need to do here -- I just wish I were more confident the same alert wasn't going to fire again while we churn through the backlog [02:58:01] I don't have much context yet on the internals, but it seems probable that it will fire again [02:58:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:59:04] yeah I mean the IRC alert will definitely keep flapping, it's just a matter of whether we're lucky enough it doesn't page [02:59:38] Can we disable paging until Monday? [02:59:40] PROBLEM - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [02:59:47] ^ haha [02:59:58] unfortunately not without leaving ourselves blind to a real problem [03:00:06] Case in point ^ [03:00:16] Sigh [03:00:20] OK then [03:00:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26252 and previous config saved to /var/cache/conftool/dbconfig/20220423-030035-ladsgroup.json [03:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:00:46] trying to figure out if there's a good mitigation here [03:01:25] Maybe pool more mw hosts for video scalers? [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:02:00] (Sorry can't sleep 😕) [03:02:04] so we get out of this state faster, you mean? yeah, although then I'd worry about making progress on other jobs [03:02:18] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:02:52] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:03:34] If it's possible to take out from appservers. That'd be nice [03:04:16] RECOVERY - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 402 bytes in 6.526 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:04:28] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:04:35] hm, conceivably? we'd need to depool them from normal traffic, I wouldn't want them to contend with ffmpeg for CPUs [03:05:01] they're different puppet roles and everything though, I'm not sure if appservers even have ffmpeg installed [03:05:13] Ugh. OK then [03:05:16] we'd have to reimage and everything, I don't really want to do that mid-incident [03:05:24] (or mid-Friday-evening if I'm honest) [03:05:52] it's a good thought though [03:06:04] Yeah. I thought they are similar [03:06:43] I'm also not sure if it would make a big enough dent - I think we're about 11 hours from getting through the video backlog at our current rate, so we'd need to commandeer a lot of machines if we wanted to get it done before EU morning for example [03:07:41] (https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=webVideoTranscode) [03:07:57] oh sorry yeah, thought I linked that earlier but I got distracted mid-thought [03:08:13] 10 hours of backlog [03:08:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:08:30] I'm also second-guessing myself about downtiming just the LVS alert for videoscaler.svc.eqiad.wmnet, so it doesn't page [03:09:00] Try it! Try it! [03:09:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:09:25] it's definitely not ideal, but, if it fires again in 11 hours we can take a look -- and we know those machines are going to be in bad shape until then [03:09:37] jhathaway: if you're still around, any opinions? [03:09:57] I think that make sense as short term bandaid, and look to pin a cpu in the longer term [03:10:57] sounds good, downtiming until 14 UTC -- I'll leave a note in the other channel as well, in case of questions [03:11:09] thanks [03:11:15] Thanks rzl [03:12:08] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.298 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:13:33] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:15:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26253 and previous config saved to /var/cache/conftool/dbconfig/20220423-031540-ladsgroup.json [03:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:08] done 👍 thanks Amir1 and jhathaway <3 I'm checking back out, hope you can get some sleep Amir [03:16:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:16:51] 😊 [03:21:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:24:10] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:26:26] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.844 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:30:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26254 and previous config saved to /var/cache/conftool/dbconfig/20220423-033045-ladsgroup.json [03:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:34:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T306560)', diff saved to https://phabricator.wikimedia.org/P26255 and previous config saved to /var/cache/conftool/dbconfig/20220423-033438-ladsgroup.json [03:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:43] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [03:36:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:38:02] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:39:42] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:40:18] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:41:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:41:52] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:45:28] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:45:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26256 and previous config saved to /var/cache/conftool/dbconfig/20220423-034550-ladsgroup.json [03:45:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [03:45:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [03:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26257 and previous config saved to /var/cache/conftool/dbconfig/20220423-034558-ladsgroup.json [03:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:42] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.325 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P26258 and previous config saved to /var/cache/conftool/dbconfig/20220423-034943-ladsgroup.json [03:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:52:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:53:54] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:54:36] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:54:40] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [03:56:02] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.443 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:56:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:56:50] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.959 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:57:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:59:10] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.629 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:00:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:01:20] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:01:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:03:40] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.824 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:04:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P26259 and previous config saved to /var/cache/conftool/dbconfig/20220423-040448-ladsgroup.json [04:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:06:48] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:17:32] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:19:42] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.365 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:19:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T306560)', diff saved to https://phabricator.wikimedia.org/P26260 and previous config saved to /var/cache/conftool/dbconfig/20220423-041953-ladsgroup.json [04:19:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [04:19:56] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:19:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [04:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:00] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [04:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T306560)', diff saved to https://phabricator.wikimedia.org/P26261 and previous config saved to /var/cache/conftool/dbconfig/20220423-042001-ladsgroup.json [04:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:22:12] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.877 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:25:26] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:26:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26262 and previous config saved to /var/cache/conftool/dbconfig/20220423-042704-ladsgroup.json [04:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:30:00] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:31:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:31:54] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:33:02] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:35:14] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.804 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:36:14] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:36:26] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:36:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:40:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:42:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26263 and previous config saved to /var/cache/conftool/dbconfig/20220423-044209-ladsgroup.json [04:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:40] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:45:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:45:54] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:47:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:48:02] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:49:44] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:50:18] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.077 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:50:24] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:51:46] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:51:54] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [04:52:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:55:12] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:55:30] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:56:42] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [04:57:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26264 and previous config saved to /var/cache/conftool/dbconfig/20220423-045714-ladsgroup.json [04:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:57:38] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.792 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:01:18] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:01:40] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:03:30] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:04:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:06:28] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:06:50] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:09:18] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:09:18] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:09:30] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:11:18] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:11:44] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.557 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:12:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26265 and previous config saved to /var/cache/conftool/dbconfig/20220423-051219-ladsgroup.json [05:12:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:12:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:56] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.663 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:15:02] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:15:18] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:15:36] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:16:18] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:16:34] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:17:22] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:18:44] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.839 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:20:40] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:22:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:22:48] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.841 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:23:04] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:25:28] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.256 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:26:54] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:27:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:28:12] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:28:38] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:28:52] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:30:24] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.242 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:31:10] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.811 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:32:06] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:32:08] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:32:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:32:42] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:34:28] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:35:56] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:36:52] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:38:00] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:38:42] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:39:04] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:39:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T306560)', diff saved to https://phabricator.wikimedia.org/P26266 and previous config saved to /var/cache/conftool/dbconfig/20220423-053940-ladsgroup.json [05:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:45] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [05:40:38] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1338.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:41:22] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:41:24] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:42:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:43:00] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:44:23] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.245 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:45:18] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.940 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:46:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1338.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:47:14] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:49:10] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:49:36] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:50:04] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:50:14] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [05:50:56] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:51:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [05:51:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [05:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26267 and previous config saved to /var/cache/conftool/dbconfig/20220423-055118-ladsgroup.json [05:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:52:58] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:53:34] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:54:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P26268 and previous config saved to /var/cache/conftool/dbconfig/20220423-055445-ladsgroup.json [05:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:04] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [05:55:26] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:55:36] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:57:18] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:57:48] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:58:06] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.316 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:02:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:02:34] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1338.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:02:48] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:02:52] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:03:58] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1438.eqiad.wmnet, mw1338.eqiad.wmnet, mw1308.eqiad.wmnet, mw1445.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:04:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:04:56] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:05:50] PROBLEM - PHP7 jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:06:32] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:06:54] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:08:04] RECOVERY - PHP7 jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.368 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:08:32] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:09:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P26269 and previous config saved to /var/cache/conftool/dbconfig/20220423-060950-ladsgroup.json [06:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:14:08] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:17:32] PROBLEM - PHP7 jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:18:28] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:18:40] PROBLEM - PHP7 rendering on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:19:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:19:48] RECOVERY - PHP7 jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.094 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:19:48] PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:20:48] RECOVERY - PHP7 rendering on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:22:12] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:22:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:23:22] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:23:32] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:24:36] PROBLEM - PHP7 jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:24:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T306560)', diff saved to https://phabricator.wikimedia.org/P26270 and previous config saved to /var/cache/conftool/dbconfig/20220423-062455-ladsgroup.json [06:24:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [06:24:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [06:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:01] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [06:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T306560)', diff saved to https://phabricator.wikimedia.org/P26271 and previous config saved to /var/cache/conftool/dbconfig/20220423-062503-ladsgroup.json [06:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:34] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:26:46] RECOVERY - PHP7 jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:27:58] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:28:50] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:31:14] PROBLEM - PHP7 rendering on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:31:16] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:32:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:35:10] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:35:28] RECOVERY - PHP7 rendering on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:36:18] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:36:36] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:36:42] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.905 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:37:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:37:33] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:38:00] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:39:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:40:18] PROBLEM - PHP7 jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:42:18] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:43:04] PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:43:44] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1446.eqiad.wmnet, mw1440.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:43:52] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:44:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:45:32] PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:47:40] PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:48:24] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:48:48] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:50:00] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:50:28] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:50:58] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.931 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [06:51:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26272 and previous config saved to /var/cache/conftool/dbconfig/20220423-065133-ladsgroup.json [06:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:39] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:51:48] PROBLEM - PHP7 rendering on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:53:32] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:54:22] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:55:38] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [06:57:48] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.586 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220423T0700) [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:04:30] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:05:26] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:06:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26273 and previous config saved to /var/cache/conftool/dbconfig/20220423-070638-ladsgroup.json [07:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:44] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.878 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:07:30] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:13:06] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:15:12] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:17:56] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:20:02] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:21:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26274 and previous config saved to /var/cache/conftool/dbconfig/20220423-072143-ladsgroup.json [07:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:02] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:24:42] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:36:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26275 and previous config saved to /var/cache/conftool/dbconfig/20220423-073648-ladsgroup.json [07:36:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:36:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26276 and previous config saved to /var/cache/conftool/dbconfig/20220423-073656-ladsgroup.json [07:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T306560)', diff saved to https://phabricator.wikimedia.org/P26277 and previous config saved to /var/cache/conftool/dbconfig/20220423-074211-ladsgroup.json [07:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:17] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [07:50:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26278 and previous config saved to /var/cache/conftool/dbconfig/20220423-075017-ladsgroup.json [07:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:57:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26279 and previous config saved to /var/cache/conftool/dbconfig/20220423-075716-ladsgroup.json [07:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:03:34] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:04:10] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:05:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26280 and previous config saved to /var/cache/conftool/dbconfig/20220423-080522-ladsgroup.json [08:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26281 and previous config saved to /var/cache/conftool/dbconfig/20220423-081221-ladsgroup.json [08:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:08] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:14:30] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [08:20:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26282 and previous config saved to /var/cache/conftool/dbconfig/20220423-082027-ladsgroup.json [08:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:24:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:27:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T306560)', diff saved to https://phabricator.wikimedia.org/P26283 and previous config saved to /var/cache/conftool/dbconfig/20220423-082726-ladsgroup.json [08:27:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [08:27:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [08:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:32] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [08:27:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T306560)', diff saved to https://phabricator.wikimedia.org/P26284 and previous config saved to /var/cache/conftool/dbconfig/20220423-082735-ladsgroup.json [08:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:03] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:35:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26285 and previous config saved to /var/cache/conftool/dbconfig/20220423-083532-ladsgroup.json [08:35:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:35:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:35:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26286 and previous config saved to /var/cache/conftool/dbconfig/20220423-083545-ladsgroup.json [08:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26287 and previous config saved to /var/cache/conftool/dbconfig/20220423-084920-ladsgroup.json [08:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:54:03] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:55:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:02:26] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:04:10] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [09:04:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26288 and previous config saved to /var/cache/conftool/dbconfig/20220423-090425-ladsgroup.json [09:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:32] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:07:00] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:19:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26289 and previous config saved to /var/cache/conftool/dbconfig/20220423-091930-ladsgroup.json [09:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26290 and previous config saved to /var/cache/conftool/dbconfig/20220423-093435-ladsgroup.json [09:34:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:34:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26291 and previous config saved to /var/cache/conftool/dbconfig/20220423-093443-ladsgroup.json [09:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:56] !log `apt-get clean` on an-airflow1001 to free some space [09:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:46:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T306560)', diff saved to https://phabricator.wikimedia.org/P26292 and previous config saved to /var/cache/conftool/dbconfig/20220423-094610-ladsgroup.json [09:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:15] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [09:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:01:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P26293 and previous config saved to /var/cache/conftool/dbconfig/20220423-100115-ladsgroup.json [10:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P26294 and previous config saved to /var/cache/conftool/dbconfig/20220423-101622-ladsgroup.json [10:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26295 and previous config saved to /var/cache/conftool/dbconfig/20220423-101955-ladsgroup.json [10:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:31:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T306560)', diff saved to https://phabricator.wikimedia.org/P26296 and previous config saved to /var/cache/conftool/dbconfig/20220423-103127-ladsgroup.json [10:31:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [10:31:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [10:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:32] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [10:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T306560)', diff saved to https://phabricator.wikimedia.org/P26297 and previous config saved to /var/cache/conftool/dbconfig/20220423-103135-ladsgroup.json [10:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:35:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26298 and previous config saved to /var/cache/conftool/dbconfig/20220423-103500-ladsgroup.json [10:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26299 and previous config saved to /var/cache/conftool/dbconfig/20220423-105005-ladsgroup.json [10:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:40] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:53:18] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:00:16] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.592e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:05:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26300 and previous config saved to /var/cache/conftool/dbconfig/20220423-110511-ladsgroup.json [11:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:05:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:06:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:11:52] PROBLEM - Disk space on ml-staging-ctrl2002 is CRITICAL: DISK CRITICAL - free space: / 1115 MB (5% inode=95%): /tmp 1115 MB (5% inode=95%): /var/tmp 1115 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [11:24:44] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:25:10] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:25:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:29:12] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:29:40] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:30:33] (ProbeDown) firing: (3) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:16] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs5002 is CRITICAL: cpu={0,10,12,14,2,4,6,8} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5002&var-datasource=eqsin+prometheus/ops [11:31:18] (ProbeDown) firing: (3) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:32:55] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:32:55] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:35:33] (ProbeDown) resolved: (3) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:35:42] RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs5002 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5002&var-datasource=eqsin+prometheus/ops [11:36:18] (ProbeDown) firing: (3) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:55] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:42:55] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:50:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T306560)', diff saved to https://phabricator.wikimedia.org/P26301 and previous config saved to /var/cache/conftool/dbconfig/20220423-115035-ladsgroup.json [11:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:41] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [11:51:56] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:16] PROBLEM - Disk space on ml-staging-ctrl2002 is CRITICAL: DISK CRITICAL - free space: / 1115 MB (5% inode=95%): /tmp 1115 MB (5% inode=95%): /var/tmp 1115 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [11:55:42] PROBLEM - Check systemd state on mw1338 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:01:02] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:05:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P26302 and previous config saved to /var/cache/conftool/dbconfig/20220423-120540-ladsgroup.json [12:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:16:48] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P26303 and previous config saved to /var/cache/conftool/dbconfig/20220423-122045-ladsgroup.json [12:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:48] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:48] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:35:28] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [12:35:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T306560)', diff saved to https://phabricator.wikimedia.org/P26304 and previous config saved to /var/cache/conftool/dbconfig/20220423-123550-ladsgroup.json [12:35:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:35:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:56] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [12:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P26305 and previous config saved to /var/cache/conftool/dbconfig/20220423-123558-ladsgroup.json [12:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:56] PROBLEM - Check systemd state on mw1440 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:04] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:59:46] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:03:48] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:04:02] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.239 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:06:08] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.091 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [13:10:33] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:26] PROBLEM - Check systemd state on mw1445 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:20] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:35:40] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:36:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P26306 and previous config saved to /var/cache/conftool/dbconfig/20220423-133614-ladsgroup.json [13:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:21] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [13:36:40] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:51:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26307 and previous config saved to /var/cache/conftool/dbconfig/20220423-135119-ladsgroup.json [13:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:22] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:03:38] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:04:56] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:06:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26308 and previous config saved to /var/cache/conftool/dbconfig/20220423-140624-ladsgroup.json [14:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P26309 and previous config saved to /var/cache/conftool/dbconfig/20220423-142129-ladsgroup.json [14:21:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:21:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:36] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [14:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:36:47] PROBLEM - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:38:59] RECOVERY - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 398 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:43:55] ^ strange, afaict from the jobqueue dashboard we did get through the videoscaling backlog as expected, but cpu is still pegged high [14:44:26] PROBLEM - Check systemd state on mw1446 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:34] this woke me up so I'm a little sluggish, may be missing something obvious :) still looking but I'll probably re-downtime for another few hours to let it clear [14:46:11] PROBLEM - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:48:13] (would love a second opinion if anyone else is around) [14:53:17] done [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:11:41] RECOVERY - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 402 bytes in 3.700 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:20:14] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 3.982 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [15:24:56] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [15:39:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:39:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:16:12] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1338.eqiad.wmnet [16:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:20] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1437.eqiad.wmnet [16:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:27] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1438.eqiad.wmnet [16:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:33] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1439.eqiad.wmnet [16:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:43] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1440.eqiad.wmnet [16:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:55] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1445.eqiad.wmnet [16:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:03] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1446.eqiad.wmnet [16:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:11] !log depool the videoscalers from the jobrunner cluster. Effectively split the 2 clusters that way. This should isolate the rest of the jobs from the video transcoding jobs reducing the latency that they are experiencing [16:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:50] !log akosiaris@cumin1001 conftool action : set/weight=8; selector: cluster=jobrunner,name=mw1336.eqiad.wmnet [16:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:57] !log akosiaris@cumin1001 conftool action : set/weight=8; selector: cluster=jobrunner,name=mw1335.eqiad.wmnet [16:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:19] !log akosiaris@cumin1001 conftool action : set/weight=4; selector: cluster=jobrunner,name=mw1335.eqiad.wmnet [16:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:24] !log akosiaris@cumin1001 conftool action : set/weight=4; selector: cluster=jobrunner,name=mw1336.eqiad.wmnet [16:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:15] !log increase mw1335 and mw1336 weights on the jobrunner cluster from 1 to 4 (they were at %25 CPU usage). That should direct more traffic to them and lighten the load on the rest. [16:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:58] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:30:18] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 92 probes of 675 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:36:22] RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [16:36:38] RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:36:40] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 675 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:42:45] (03CR) 10BryanDavis: [C: 03+2] Add perl532-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/778683 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis) [16:43:21] (03Merged) 10jenkins-bot: Add perl532-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/778683 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis) [16:44:03] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [16:44:26] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:45:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:33] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:48] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:52:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:50] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [16:57:14] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:57:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:59:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:59:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26310 and previous config saved to /var/cache/conftool/dbconfig/20220423-165939-ladsgroup.json [16:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:44] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:00:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:07:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:08:46] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:09:44] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:10:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:33] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:13:02] PROBLEM - SSH on furud.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:13:53] (03CR) 10BryanDavis: [C: 03+2] kubernetes: Fix default resource handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/783663 (owner: 10Majavah) [17:14:18] PROBLEM - SSH on wtp1035.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:14:47] (03Merged) 10jenkins-bot: kubernetes: Fix default resource handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/783663 (owner: 10Majavah) [17:17:10] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.809 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [17:17:50] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:18:30] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:19:20] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:32:17] (03PS1) 10BryanDavis: k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343) [17:33:12] (03CR) 10jerkins-bot: [V: 04-1] k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis) [17:34:22] (03PS2) 10BryanDavis: k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343) [17:37:51] (03CR) 10BryanDavis: Perform rolling restarts on kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister) [17:39:01] (03Abandoned) 10BryanDavis: Man page for webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) (owner: 10Nehajha) [17:44:35] (03CR) 10BryanDavis: [C: 03+2] k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis) [17:45:30] (03Merged) 10jenkins-bot: k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis) [17:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:51:18] (03PS1) 10BryanDavis: d/changelog: Prepare for 0.82 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785375 (https://phabricator.wikimedia.org/T214343) [17:58:07] (03CR) 10BryanDavis: [C: 03+2] d/changelog: Prepare for 0.82 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785375 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis) [17:59:56] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.82 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785375 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis) [18:02:10] RECOVERY - PHP7 jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [18:02:38] RECOVERY - PHP7 rendering on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:27:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26311 and previous config saved to /var/cache/conftool/dbconfig/20220423-182701-ladsgroup.json [18:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:08] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:42:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P26312 and previous config saved to /var/cache/conftool/dbconfig/20220423-184206-ladsgroup.json [18:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P26313 and previous config saved to /var/cache/conftool/dbconfig/20220423-185711-ladsgroup.json [18:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:12:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26314 and previous config saved to /var/cache/conftool/dbconfig/20220423-191216-ladsgroup.json [19:12:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [19:12:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [19:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:21] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26315 and previous config saved to /var/cache/conftool/dbconfig/20220423-191224-ladsgroup.json [19:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:22] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [19:19:12] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:17:08] (03PS2) 10Krinkle: Stop writing to $wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781058 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:18:08] RECOVERY - SSH on wtp1035.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26316 and previous config saved to /var/cache/conftool/dbconfig/20220423-203808-ladsgroup.json [20:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:15] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:47:10] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:53:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P26317 and previous config saved to /var/cache/conftool/dbconfig/20220423-205313-ladsgroup.json [20:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:08:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P26318 and previous config saved to /var/cache/conftool/dbconfig/20220423-210819-ladsgroup.json [21:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:34] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:10] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26319 and previous config saved to /var/cache/conftool/dbconfig/20220423-212324-ladsgroup.json [21:23:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [21:23:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [21:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:29] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T306560)', diff saved to https://phabricator.wikimedia.org/P26320 and previous config saved to /var/cache/conftool/dbconfig/20220423-212332-ladsgroup.json [21:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:50:46] RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:51:36] RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:22:18] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:42:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T306560)', diff saved to https://phabricator.wikimedia.org/P26321 and previous config saved to /var/cache/conftool/dbconfig/20220423-224220-ladsgroup.json [22:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:28] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:43:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:57:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P26322 and previous config saved to /var/cache/conftool/dbconfig/20220423-225725-ladsgroup.json [22:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:10:34] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:10:58] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:11:58] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P26323 and previous config saved to /var/cache/conftool/dbconfig/20220423-231230-ladsgroup.json [23:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:52] RECOVERY - SSH on furud.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:27:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T306560)', diff saved to https://phabricator.wikimedia.org/P26324 and previous config saved to /var/cache/conftool/dbconfig/20220423-232735-ladsgroup.json [23:27:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [23:27:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [23:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:27:41] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [23:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T306560)', diff saved to https://phabricator.wikimedia.org/P26325 and previous config saved to /var/cache/conftool/dbconfig/20220423-232748-ladsgroup.json [23:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:00] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:59:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown