[00:00:30] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[00:00:58] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:46] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.367 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[00:03:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:06:00] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:07:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:07:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.541 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:08:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P26234 and previous config saved to /var/cache/conftool/dbconfig/20220423-000839-ladsgroup.json
[00:08:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26235 and previous config saved to /var/cache/conftool/dbconfig/20220423-001058-ladsgroup.json
[00:11:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:11:03] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[00:16:46] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[00:19:42] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:18] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.099 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[00:23:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:23:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26236 and previous config saved to /var/cache/conftool/dbconfig/20220423-002344-ladsgroup.json
[00:23:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[00:23:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[00:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:49] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[00:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26237 and previous config saved to /var/cache/conftool/dbconfig/20220423-002352-ladsgroup.json
[00:23:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26238 and previous config saved to /var/cache/conftool/dbconfig/20220423-002603-ladsgroup.json
[00:26:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:52] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:27:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:29:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:29:26] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:31:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:31:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.647 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:35:58] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[00:37:06] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:37:22] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:38:06] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[00:39:14] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:39:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:39:40] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.892 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:41:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26239 and previous config saved to /var/cache/conftool/dbconfig/20220423-004108-ladsgroup.json
[00:41:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:42:00] <icinga-wm>	 PROBLEM - Check systemd state on mw1438 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:18] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:42:50] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:44:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:46:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[00:46:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[00:46:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26240 and previous config saved to /var/cache/conftool/dbconfig/20220423-004617-ladsgroup.json
[00:46:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:22] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[00:47:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:47:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:47:28] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[00:49:08] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[00:49:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26241 and previous config saved to /var/cache/conftool/dbconfig/20220423-004935-ladsgroup.json
[00:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:52:04] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.787 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[00:56:02] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[00:56:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298565)', diff saved to https://phabricator.wikimedia.org/P26242 and previous config saved to /var/cache/conftool/dbconfig/20220423-005613-ladsgroup.json
[00:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[00:56:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:56:19] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[00:56:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[00:56:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance
[00:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance
[00:56:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:52] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[00:59:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[00:59:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[00:59:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:01:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:01:48] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:02:44] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:03:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:03:38] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:04:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26243 and previous config saved to /var/cache/conftool/dbconfig/20220423-010440-ladsgroup.json
[01:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:06:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:06:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:07:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:10:06] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:11:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:12:30] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:13:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:14:32] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:14:42] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:16:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:16:58] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.207 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:18:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:19:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26244 and previous config saved to /var/cache/conftool/dbconfig/20220423-011945-ladsgroup.json
[01:19:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:24:18] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:27:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:29:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:29:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:30:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:30:32] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:32:56] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:33:28] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[01:33:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26245 and previous config saved to /var/cache/conftool/dbconfig/20220423-013336-ladsgroup.json
[01:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:33:44] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[01:34:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:34:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[01:34:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[01:34:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P26246 and previous config saved to /var/cache/conftool/dbconfig/20220423-013450-ladsgroup.json
[01:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:54] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[01:35:02] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:35:08] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.112 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:39:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:39:56] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:16] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.820 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:44:33] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:45:12] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 22.4 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[01:45:12] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:45:18] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:45:24] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 14.26 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[01:45:36] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 13.11 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[01:45:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:52] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:47:22] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:47:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.985 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:47:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:48:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P26247 and previous config saved to /var/cache/conftool/dbconfig/20220423-014841-ladsgroup.json
[01:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:46] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:49:00] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:49:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:49:48] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 100.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[01:50:00] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[01:50:12] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6
[01:51:24] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:53:16] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.968 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:53:30] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:53:36] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.230 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:53:46] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:55:56] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.437 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:56:30] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.236 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[01:58:20] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:59:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:00:02] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:00:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:00:32] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.359 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:00:40] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:01:12] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:03:22] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:03:34] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:03:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P26248 and previous config saved to /var/cache/conftool/dbconfig/20220423-020346-ladsgroup.json
[02:03:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:05:10] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:05:18] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:07:00] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.457 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:07:34] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.642 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:08:04] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:08:10] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:09:34] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:12:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[02:12:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[02:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26249 and previous config saved to /var/cache/conftool/dbconfig/20220423-021211-ladsgroup.json
[02:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:16] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:12:16] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[02:12:40] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:14:28] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.143 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:14:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:16:10] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:18:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:18:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[02:18:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[02:18:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:18:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:18:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T306560)', diff saved to https://phabricator.wikimedia.org/P26250 and previous config saved to /var/cache/conftool/dbconfig/20220423-021826-ladsgroup.json
[02:18:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:18:30] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[02:18:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P26251 and previous config saved to /var/cache/conftool/dbconfig/20220423-021851-ladsgroup.json
[02:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:18:57] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[02:19:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:20:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:21:48] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:23:52] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:24:22] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:24:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:25:36] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:26:34] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.474 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:26:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:27:46] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:28:30] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:29:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:29:52] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:30:40] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:30:42] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.462 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:34:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.422 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:35:32] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:36:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:37:40] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:40:08] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:40:10] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:41:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:42:28] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:43:46] <icinga-wm>	 PROBLEM - LVS jobrunner eqiad port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.eqiad.wmnet IPv4 #page on jobrunner.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[02:44:35] <rzl>	 hey, looking
[02:44:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:44:46] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.518 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:45:40] <jhathaway>	 here as well
[02:45:56] <icinga-wm>	 RECOVERY - LVS jobrunner eqiad port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.eqiad.wmnet IPv4 #page on jobrunner.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 400 bytes in 1.234 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[02:46:12] <rzl>	 I think this is requeueing from https://phabricator.wikimedia.org/T306697 but I don't have a ton of context, trying to catch up
[02:46:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:46:31] <jhathaway>	 rzl: nod, looking at that task
[02:47:11] <Amir1>	 Around but it recovered?
[02:47:46] <rzl>	 it did but it's been flapping for long enough that I want to make sure it's stable
[02:49:01] <rzl>	 ffmpeg is still maxing CPU, unsurprisingly
[02:49:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:49:10] <Amir1>	 Sigh
[02:49:26] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:49:34] <rzl>	 the job queue should still be okay though, since we have some jobrunner hosts depooled from videoscaling so other jobs aren't getting starved out
[02:50:19] <Amir1>	 Yeah. The only thing we can do is to wait for it to stabilize I guess
[02:50:58] <rzl>	 yeah, I wish we didn't fail healthchecks in that state though :/ wonder if there's already a task for it
[02:51:14] <jhathaway>	 do we have a way to view the queue depth?
[02:51:24] <rzl>	 nice if we could leave one core free, or something
[02:51:41] <Amir1>	 There is a grafana dashboard 
[02:51:42] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.463 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:52:20] <jhathaway>	 rzl: yeah leave on cpu pinned to other tasks would be a good idea
[02:52:25] <jhathaway>	 *one
[02:52:59] <Amir1>	 How you can os to do that?
[02:53:16] <Amir1>	 I think making it nicer would help
[02:53:43] <Amir1>	 *can tell
[02:53:52] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.643 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:54:02] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.298 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:54:55] <Amir1>	 I go rest though. 5am here 👋🤦
[02:55:10] <jhathaway>	 Amir1: night night, thanks
[02:55:20] <Amir1>	 Anything needed can wait until Monday 
[02:56:18] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:56:21] <rzl>	 yeah I'm pretty sure there's nothing we need to do here -- I just wish I were more confident the same alert wasn't going to fire again while we churn through the backlog
[02:58:01] <jhathaway>	 I don't have much context yet on the internals, but it seems probable that it will fire again
[02:58:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:59:04] <rzl>	 yeah I mean the IRC alert will definitely keep flapping, it's just a matter of whether we're lucky enough it doesn't page
[02:59:38] <Amir1>	 Can we disable paging until Monday? 
[02:59:40] <icinga-wm>	 PROBLEM - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[02:59:47] <rzl>	 ^ haha
[02:59:58] <rzl>	 unfortunately not without leaving ourselves blind to a real problem
[03:00:06] <Amir1>	 Case in point ^
[03:00:16] <Amir1>	 Sigh
[03:00:20] <Amir1>	 OK then 
[03:00:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26252 and previous config saved to /var/cache/conftool/dbconfig/20220423-030035-ladsgroup.json
[03:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:00:42] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[03:00:46] <rzl>	 trying to figure out if there's a good mitigation here
[03:01:25] <Amir1>	 Maybe pool more mw hosts for video scalers?
[03:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:02:00] <Amir1>	 (Sorry can't sleep 😕)
[03:02:04] <rzl>	 so we get out of this state faster, you mean? yeah, although then I'd worry about making progress on other jobs
[03:02:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:02:52] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:03:34] <Amir1>	 If it's possible to take out from appservers. That'd be nice
[03:04:16] <icinga-wm>	 RECOVERY - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 402 bytes in 6.526 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[03:04:28] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:04:35] <rzl>	 hm, conceivably? we'd need to depool them from normal traffic, I wouldn't want them to contend with ffmpeg for CPUs
[03:05:01] <rzl>	 they're different puppet roles and everything though, I'm not sure if appservers even have ffmpeg installed
[03:05:13] <Amir1>	 Ugh. OK then
[03:05:16] <rzl>	 we'd have to reimage and everything, I don't really want to do that mid-incident
[03:05:24] <rzl>	 (or mid-Friday-evening if I'm honest)
[03:05:52] <rzl>	 it's a good thought though
[03:06:04] <Amir1>	 Yeah. I thought they are similar 
[03:06:43] <rzl>	 I'm also not sure if it would make a big enough dent - I think we're about 11 hours from getting through the video backlog at our current rate, so we'd need to commandeer a lot of machines if we wanted to get it done before EU morning for example
[03:07:41] <Amir1>	  (https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=webVideoTranscode)
[03:07:57] <rzl>	 oh sorry yeah, thought I linked that earlier but I got distracted mid-thought
[03:08:13] <Amir1>	 10 hours of backlog 
[03:08:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:08:30] <rzl>	 I'm also second-guessing myself about downtiming just the LVS alert for videoscaler.svc.eqiad.wmnet, so it doesn't page
[03:09:00] <Amir1>	 Try it! Try it!
[03:09:18] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:09:25] <rzl>	 it's definitely not ideal, but, if it fires again in 11 hours we can take a look -- and we know those machines are going to be in bad shape until then
[03:09:37] <rzl>	 jhathaway: if you're still around, any opinions?
[03:09:57] <jhathaway>	 I think that make sense as short term bandaid, and look to pin a cpu in the longer term
[03:10:57] <rzl>	 sounds good, downtiming until 14 UTC -- I'll leave a note in the other channel as well, in case of questions
[03:11:09] <jhathaway>	 thanks
[03:11:15] <Amir1>	 Thanks rzl
[03:12:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.298 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:13:33] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:15:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26253 and previous config saved to /var/cache/conftool/dbconfig/20220423-031540-ladsgroup.json
[03:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:16:08] <rzl>	 done 👍 thanks Amir1 and jhathaway <3 I'm checking back out, hope you can get some sleep Amir
[03:16:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:16:51] <Amir1>	 😊 
[03:21:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:24:10] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[03:26:26] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.844 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[03:30:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26254 and previous config saved to /var/cache/conftool/dbconfig/20220423-033045-ladsgroup.json
[03:30:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:31:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:34:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T306560)', diff saved to https://phabricator.wikimedia.org/P26255 and previous config saved to /var/cache/conftool/dbconfig/20220423-033438-ladsgroup.json
[03:34:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:34:43] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[03:36:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:38:02] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:39:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:40:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:41:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:41:52] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:45:28] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[03:45:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26256 and previous config saved to /var/cache/conftool/dbconfig/20220423-034550-ladsgroup.json
[03:45:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[03:45:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[03:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:45:55] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[03:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:45:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26257 and previous config saved to /var/cache/conftool/dbconfig/20220423-034558-ladsgroup.json
[03:45:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:47:42] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.325 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[03:49:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P26258 and previous config saved to /var/cache/conftool/dbconfig/20220423-034943-ladsgroup.json
[03:49:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:51:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:52:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:53:54] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:54:36] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:54:40] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[03:56:02] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.443 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:56:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:56:50] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.959 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[03:57:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:59:10] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.629 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:00:18] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:01:20] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[04:01:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:03:40] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.824 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[04:04:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P26259 and previous config saved to /var/cache/conftool/dbconfig/20220423-040448-ladsgroup.json
[04:04:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:05:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:06:48] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:17:32] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:19:42] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.365 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:19:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T306560)', diff saved to https://phabricator.wikimedia.org/P26260 and previous config saved to /var/cache/conftool/dbconfig/20220423-041953-ladsgroup.json
[04:19:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[04:19:56] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[04:19:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[04:19:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:20:00] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[04:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:20:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T306560)', diff saved to https://phabricator.wikimedia.org/P26261 and previous config saved to /var/cache/conftool/dbconfig/20220423-042001-ladsgroup.json
[04:20:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:20:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:21:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:22:12] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.877 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[04:25:26] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:26:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:27:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26262 and previous config saved to /var/cache/conftool/dbconfig/20220423-042704-ladsgroup.json
[04:27:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:27:10] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[04:30:00] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.071 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:31:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:31:54] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[04:33:02] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:35:14] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.804 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:36:14] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[04:36:26] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[04:36:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:40:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:42:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26263 and previous config saved to /var/cache/conftool/dbconfig/20220423-044209-ladsgroup.json
[04:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:43:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:45:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:45:54] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[04:47:18] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:48:02] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:49:44] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[04:50:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.077 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:50:24] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[04:51:46] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[04:51:54] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[04:52:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:55:12] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[04:55:30] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[04:56:42] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[04:57:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26264 and previous config saved to /var/cache/conftool/dbconfig/20220423-045714-ladsgroup.json
[04:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:57:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:57:38] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.792 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:01:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:01:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:02:44] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:03:30] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:04:18] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:06:28] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:06:50] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:09:18] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:09:18] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:09:30] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:11:18] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:11:44] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.557 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:12:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26265 and previous config saved to /var/cache/conftool/dbconfig/20220423-051219-ladsgroup.json
[05:12:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[05:12:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[05:12:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:12:25] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[05:12:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:13:56] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.663 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:15:02] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:15:18] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:15:36] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:16:18] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:16:34] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:17:22] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:18:44] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.839 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:20:40] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:22:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:22:48] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.841 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:23:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:25:28] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.256 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:26:54] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:27:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:28:12] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:28:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:28:52] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:30:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 2.242 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:31:10] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.811 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:32:06] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:32:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:32:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:32:42] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:34:28] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:35:56] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:36:52] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:38:00] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:38:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:39:04] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:39:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T306560)', diff saved to https://phabricator.wikimedia.org/P26266 and previous config saved to /var/cache/conftool/dbconfig/20220423-053940-ladsgroup.json
[05:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:39:45] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[05:40:38] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1338.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:41:22] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:41:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:42:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:43:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:44:23] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.245 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:45:18] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 7.940 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:46:00] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:46:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1338.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:47:14] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:47:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:49:10] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:49:36] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:50:04] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:50:14] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[05:50:56] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:51:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[05:51:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[05:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26267 and previous config saved to /var/cache/conftool/dbconfig/20220423-055118-ladsgroup.json
[05:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:22] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[05:52:58] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:53:34] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:54:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P26268 and previous config saved to /var/cache/conftool/dbconfig/20220423-055445-ladsgroup.json
[05:54:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:04] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[05:55:26] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[05:55:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:57:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:57:48] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:58:06] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.316 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:02:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:02:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1338.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:02:48] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:02:52] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:03:58] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1438.eqiad.wmnet, mw1338.eqiad.wmnet, mw1308.eqiad.wmnet, mw1445.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:04:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:04:56] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:05:50] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:06:32] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:06:54] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:08:04] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.368 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:08:32] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:09:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:09:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P26269 and previous config saved to /var/cache/conftool/dbconfig/20220423-060950-ladsgroup.json
[06:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:14:08] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:17:32] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:18:28] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:18:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:19:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:19:48] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.094 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:19:48] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:20:48] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:22:12] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:22:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:23:22] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:23:32] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:24:36] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:24:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T306560)', diff saved to https://phabricator.wikimedia.org/P26270 and previous config saved to /var/cache/conftool/dbconfig/20220423-062455-ladsgroup.json
[06:24:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[06:24:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[06:24:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:01] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[06:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T306560)', diff saved to https://phabricator.wikimedia.org/P26271 and previous config saved to /var/cache/conftool/dbconfig/20220423-062503-ladsgroup.json
[06:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:34] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:26:46] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:27:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:28:50] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:31:14] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:31:16] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:32:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:35:10] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:35:28] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:36:18] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:36:36] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:36:42] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 1.905 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:37:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[06:37:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:38:00] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:39:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:40:18] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:42:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:43:04] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:43:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - videoscaler_443: Servers mw1446.eqiad.wmnet, mw1440.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:43:52] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:44:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:45:32] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1446 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:47:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1440 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:48:24] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:48:48] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:50:00] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:50:28] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:50:58] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.931 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[06:51:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26272 and previous config saved to /var/cache/conftool/dbconfig/20220423-065133-ladsgroup.json
[06:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:39] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[06:51:48] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:53:32] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:54:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:55:38] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[06:57:48] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.586 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220423T0700)
[07:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:04:30] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[07:05:26] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:06:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26273 and previous config saved to /var/cache/conftool/dbconfig/20220423-070638-ladsgroup.json
[07:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:44] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.878 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[07:07:30] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:13:06] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:15:12] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:17:56] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[07:20:02] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[07:21:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26274 and previous config saved to /var/cache/conftool/dbconfig/20220423-072143-ladsgroup.json
[07:21:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:02] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:24:42] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[07:36:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298565)', diff saved to https://phabricator.wikimedia.org/P26275 and previous config saved to /var/cache/conftool/dbconfig/20220423-073648-ladsgroup.json
[07:36:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[07:36:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[07:36:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:53] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[07:36:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26276 and previous config saved to /var/cache/conftool/dbconfig/20220423-073656-ladsgroup.json
[07:36:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T306560)', diff saved to https://phabricator.wikimedia.org/P26277 and previous config saved to /var/cache/conftool/dbconfig/20220423-074211-ladsgroup.json
[07:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:17] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[07:50:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26278 and previous config saved to /var/cache/conftool/dbconfig/20220423-075017-ladsgroup.json
[07:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:21] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[07:57:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26279 and previous config saved to /var/cache/conftool/dbconfig/20220423-075716-ladsgroup.json
[07:57:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:03:34] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[08:04:10] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[08:05:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26280 and previous config saved to /var/cache/conftool/dbconfig/20220423-080522-ladsgroup.json
[08:05:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26281 and previous config saved to /var/cache/conftool/dbconfig/20220423-081221-ladsgroup.json
[08:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[08:14:30] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[08:20:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26282 and previous config saved to /var/cache/conftool/dbconfig/20220423-082027-ladsgroup.json
[08:20:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:33] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:24:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:27:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T306560)', diff saved to https://phabricator.wikimedia.org/P26283 and previous config saved to /var/cache/conftool/dbconfig/20220423-082726-ladsgroup.json
[08:27:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[08:27:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[08:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:32] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[08:27:33] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T306560)', diff saved to https://phabricator.wikimedia.org/P26284 and previous config saved to /var/cache/conftool/dbconfig/20220423-082735-ladsgroup.json
[08:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:03] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:35:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298565)', diff saved to https://phabricator.wikimedia.org/P26285 and previous config saved to /var/cache/conftool/dbconfig/20220423-083532-ladsgroup.json
[08:35:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[08:35:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[08:35:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:35:37] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[08:35:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26286 and previous config saved to /var/cache/conftool/dbconfig/20220423-083545-ladsgroup.json
[08:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26287 and previous config saved to /var/cache/conftool/dbconfig/20220423-084920-ladsgroup.json
[08:49:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:25] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[08:54:03] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:55:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:02:26] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:02:44] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:04:10] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[09:04:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26288 and previous config saved to /var/cache/conftool/dbconfig/20220423-090425-ladsgroup.json
[09:04:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:32] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:07:00] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:19:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26289 and previous config saved to /var/cache/conftool/dbconfig/20220423-091930-ladsgroup.json
[09:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298565)', diff saved to https://phabricator.wikimedia.org/P26290 and previous config saved to /var/cache/conftool/dbconfig/20220423-093435-ladsgroup.json
[09:34:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[09:34:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[09:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:41] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[09:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26291 and previous config saved to /var/cache/conftool/dbconfig/20220423-093443-ladsgroup.json
[09:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:56] <elukey>	 !log `apt-get clean` on an-airflow1001 to free some space
[09:38:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:00] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:46:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T306560)', diff saved to https://phabricator.wikimedia.org/P26292 and previous config saved to /var/cache/conftool/dbconfig/20220423-094610-ladsgroup.json
[09:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:15] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[09:47:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:01:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P26293 and previous config saved to /var/cache/conftool/dbconfig/20220423-100115-ladsgroup.json
[10:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P26294 and previous config saved to /var/cache/conftool/dbconfig/20220423-101622-ladsgroup.json
[10:16:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26295 and previous config saved to /var/cache/conftool/dbconfig/20220423-101955-ladsgroup.json
[10:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:00] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[10:31:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T306560)', diff saved to https://phabricator.wikimedia.org/P26296 and previous config saved to /var/cache/conftool/dbconfig/20220423-103127-ladsgroup.json
[10:31:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[10:31:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[10:31:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:32] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[10:31:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T306560)', diff saved to https://phabricator.wikimedia.org/P26297 and previous config saved to /var/cache/conftool/dbconfig/20220423-103135-ladsgroup.json
[10:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:35:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26298 and previous config saved to /var/cache/conftool/dbconfig/20220423-103500-ladsgroup.json
[10:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26299 and previous config saved to /var/cache/conftool/dbconfig/20220423-105005-ladsgroup.json
[10:50:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:40] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[10:53:18] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:00:16] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.592e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11
[11:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:05:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298565)', diff saved to https://phabricator.wikimedia.org/P26300 and previous config saved to /var/cache/conftool/dbconfig/20220423-110511-ladsgroup.json
[11:05:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:17] <stashbot>	 T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565
[11:05:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:06:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:11:52] <icinga-wm>	 PROBLEM - Disk space on ml-staging-ctrl2002 is CRITICAL: DISK CRITICAL - free space: / 1115 MB (5% inode=95%): /tmp 1115 MB (5% inode=95%): /var/tmp 1115 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops
[11:24:44] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:25:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:25:33] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:26:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:29:12] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:29:40] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:30:33] <jinxer-wm>	 (ProbeDown) firing: (3) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:31:16] <icinga-wm>	 PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs5002 is CRITICAL: cpu={0,10,12,14,2,4,6,8} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5002&var-datasource=eqsin+prometheus/ops
[11:31:18] <jinxer-wm>	 (ProbeDown) firing: (3) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:32:55] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:32:55] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:35:33] <jinxer-wm>	 (ProbeDown) resolved: (3) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:35:42] <icinga-wm>	 RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs5002 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs5002&var-datasource=eqsin+prometheus/ops
[11:36:18] <jinxer-wm>	 (ProbeDown) firing: (3) Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:42:55] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:42:55] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:50:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T306560)', diff saved to https://phabricator.wikimedia.org/P26301 and previous config saved to /var/cache/conftool/dbconfig/20220423-115035-ladsgroup.json
[11:50:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:41] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[11:51:56] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:54:16] <icinga-wm>	 PROBLEM - Disk space on ml-staging-ctrl2002 is CRITICAL: DISK CRITICAL - free space: / 1115 MB (5% inode=95%): /tmp 1115 MB (5% inode=95%): /var/tmp 1115 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops
[11:55:42] <icinga-wm>	 PROBLEM - Check systemd state on mw1338 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:01:02] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:05:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P26302 and previous config saved to /var/cache/conftool/dbconfig/20220423-120540-ladsgroup.json
[12:05:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:16:48] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:20:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P26303 and previous config saved to /var/cache/conftool/dbconfig/20220423-122045-ladsgroup.json
[12:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:48] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:22:48] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:35:28] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[12:35:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T306560)', diff saved to https://phabricator.wikimedia.org/P26304 and previous config saved to /var/cache/conftool/dbconfig/20220423-123550-ladsgroup.json
[12:35:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[12:35:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[12:35:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:56] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[12:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P26305 and previous config saved to /var/cache/conftool/dbconfig/20220423-123558-ladsgroup.json
[12:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:56] <icinga-wm>	 PROBLEM - Check systemd state on mw1440 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:38:04] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[12:59:46] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:02:44] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:03:48] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[13:04:02] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.239 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:06:08] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.091 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[13:10:33] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:20:26] <icinga-wm>	 PROBLEM - Check systemd state on mw1445 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:20] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:35:40] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[13:36:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P26306 and previous config saved to /var/cache/conftool/dbconfig/20220423-133614-ladsgroup.json
[13:36:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:21] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[13:36:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:46:00] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:47:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:51:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26307 and previous config saved to /var/cache/conftool/dbconfig/20220423-135119-ladsgroup.json
[13:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:22] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:03:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:04:56] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[14:06:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26308 and previous config saved to /var/cache/conftool/dbconfig/20220423-140624-ladsgroup.json
[14:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P26309 and previous config saved to /var/cache/conftool/dbconfig/20220423-142129-ladsgroup.json
[14:21:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[14:21:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[14:21:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:36] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[14:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:36:47] <icinga-wm>	 PROBLEM - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[14:38:59] <icinga-wm>	 RECOVERY - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 398 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[14:43:55] <rzl>	 ^ strange, afaict from the jobqueue dashboard we did get through the videoscaling backlog as expected, but cpu is still pegged high
[14:44:26] <icinga-wm>	 PROBLEM - Check systemd state on mw1446 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:34] <rzl>	 this woke me up so I'm a little sluggish, may be missing something obvious :) still looking but I'll probably re-downtime for another few hours to let it clear
[14:46:11] <icinga-wm>	 PROBLEM - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[14:48:13] <rzl>	 (would love a second opinion if anyone else is around)
[14:53:17] <rzl>	 done
[15:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:11:41] <icinga-wm>	 RECOVERY - LVS videoscaler eqiad port 443/tcp - Videoscaler LVS interface -https-. videoscaler.svc.eqiad.wmnet IPv4 #page on videoscaler.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 402 bytes in 3.700 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[15:20:14] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 3.982 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[15:24:56] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[15:39:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[15:39:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[15:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:16:12] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1338.eqiad.wmnet
[16:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:20] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1437.eqiad.wmnet
[16:16:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:27] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1438.eqiad.wmnet
[16:16:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:33] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1439.eqiad.wmnet
[16:16:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:43] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1440.eqiad.wmnet
[16:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:55] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1445.eqiad.wmnet
[16:16:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:03] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1446.eqiad.wmnet
[16:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:11] <akosiaris>	 !log depool the videoscalers from the jobrunner cluster. Effectively split the 2 clusters that way. This should isolate the rest of the jobs from the video transcoding jobs reducing the latency that they are experiencing
[16:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:50] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=8; selector: cluster=jobrunner,name=mw1336.eqiad.wmnet
[16:22:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:57] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=8; selector: cluster=jobrunner,name=mw1335.eqiad.wmnet
[16:22:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:19] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=4; selector: cluster=jobrunner,name=mw1335.eqiad.wmnet
[16:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:24] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=4; selector: cluster=jobrunner,name=mw1336.eqiad.wmnet
[16:24:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:15] <akosiaris>	 !log increase mw1335 and mw1336 weights on the jobrunner cluster from 1 to 4 (they were at %25 CPU usage). That should direct more traffic to them and lighten the load on the rest.
[16:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:58] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:30:18] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 92 probes of 675 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:36:22] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[16:36:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1440 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[16:36:40] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 675 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:42:45] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] Add perl532-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/778683 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis)
[16:43:21] <wikibugs>	 (03Merged) 10jenkins-bot: Add perl532-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/778683 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis)
[16:44:03] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[16:44:26] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[16:45:33] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:47:33] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:47:48] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:52:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:55:50] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[16:57:14] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[16:57:18] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:59:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[16:59:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[16:59:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26310 and previous config saved to /var/cache/conftool/dbconfig/20220423-165939-ladsgroup.json
[16:59:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:44] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[17:00:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:02:44] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:07:33] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:08:46] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:09:44] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[17:10:18] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:12:33] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:13:02] <icinga-wm>	 PROBLEM - SSH on furud.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:13:53] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] kubernetes: Fix default resource handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/783663 (owner: 10Majavah)
[17:14:18] <icinga-wm>	 PROBLEM - SSH on wtp1035.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:14:47] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: Fix default resource handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/783663 (owner: 10Majavah)
[17:17:10] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.809 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[17:17:50] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:18:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:19:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:32:17] <wikibugs>	 (03PS1) 10BryanDavis: k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343)
[17:33:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis)
[17:34:22] <wikibugs>	 (03PS2) 10BryanDavis: k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343)
[17:37:51] <wikibugs>	 (03CR) 10BryanDavis: Perform rolling restarts on kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/721989 (https://phabricator.wikimedia.org/T290833) (owner: 10Lucas Werkmeister)
[17:39:01] <wikibugs>	 (03Abandoned) 10BryanDavis: Man page for webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437054 (https://phabricator.wikimedia.org/T95097) (owner: 10Nehajha)
[17:44:35] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis)
[17:45:30] <wikibugs>	 (03Merged) 10jenkins-bot: k8s: add perl5.32 type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785374 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis)
[17:46:00] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:47:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:51:18] <wikibugs>	 (03PS1) 10BryanDavis: d/changelog: Prepare for 0.82 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785375 (https://phabricator.wikimedia.org/T214343)
[17:58:07] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] d/changelog: Prepare for 0.82 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785375 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis)
[17:59:56] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.82 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/785375 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis)
[18:02:10] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[18:02:38] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:27:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26311 and previous config saved to /var/cache/conftool/dbconfig/20220423-182701-ladsgroup.json
[18:27:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:08] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[18:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:42:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P26312 and previous config saved to /var/cache/conftool/dbconfig/20220423-184206-ladsgroup.json
[18:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P26313 and previous config saved to /var/cache/conftool/dbconfig/20220423-185711-ladsgroup.json
[18:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:12:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26314 and previous config saved to /var/cache/conftool/dbconfig/20220423-191216-ladsgroup.json
[19:12:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[19:12:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[19:12:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:21] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[19:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26315 and previous config saved to /var/cache/conftool/dbconfig/20220423-191224-ladsgroup.json
[19:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:22] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[19:19:12] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:17:08] <wikibugs>	 (03PS2) 10Krinkle: Stop writing to $wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/781058 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe)
[20:18:08] <icinga-wm>	 RECOVERY - SSH on wtp1035.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:38:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26316 and previous config saved to /var/cache/conftool/dbconfig/20220423-203808-ladsgroup.json
[20:38:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:15] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[20:47:10] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:53:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P26317 and previous config saved to /var/cache/conftool/dbconfig/20220423-205313-ladsgroup.json
[20:53:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:44] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:08:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P26318 and previous config saved to /var/cache/conftool/dbconfig/20220423-210819-ladsgroup.json
[21:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:34] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:21:10] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:23:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P26319 and previous config saved to /var/cache/conftool/dbconfig/20220423-212324-ladsgroup.json
[21:23:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[21:23:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[21:23:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:29] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[21:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T306560)', diff saved to https://phabricator.wikimedia.org/P26320 and previous config saved to /var/cache/conftool/dbconfig/20220423-212332-ladsgroup.json
[21:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:00] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:47:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:50:46] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:51:36] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1446 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[22:22:18] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:32:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:42:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T306560)', diff saved to https://phabricator.wikimedia.org/P26321 and previous config saved to /var/cache/conftool/dbconfig/20220423-224220-ladsgroup.json
[22:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:28] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[22:43:16] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:57:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P26322 and previous config saved to /var/cache/conftool/dbconfig/20220423-225725-ladsgroup.json
[22:57:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:10:34] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[23:10:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[23:11:58] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:12:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P26323 and previous config saved to /var/cache/conftool/dbconfig/20220423-231230-ladsgroup.json
[23:12:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:52] <icinga-wm>	 RECOVERY - SSH on furud.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:27:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T306560)', diff saved to https://phabricator.wikimedia.org/P26324 and previous config saved to /var/cache/conftool/dbconfig/20220423-232735-ladsgroup.json
[23:27:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[23:27:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[23:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[23:27:41] <stashbot>	 T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560
[23:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[23:27:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T306560)', diff saved to https://phabricator.wikimedia.org/P26325 and previous config saved to /var/cache/conftool/dbconfig/20220423-232748-ladsgroup.json
[23:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:00] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:59:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown