[00:00:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23227 and previous config saved to /var/cache/conftool/dbconfig/20220327-000023-ladsgroup.json [00:00:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [00:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [00:00:29] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:00:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [00:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [00:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:50] RECOVERY - SSH on ml-serve-ctrl1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:01:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:02:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:03:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:06:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [00:06:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [00:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:28:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [00:28:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [00:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:57] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773863 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [00:49:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [00:50:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [00:50:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23228 and previous config saved to /var/cache/conftool/dbconfig/20220327-005010-ladsgroup.json [00:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:54:32] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773864 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [01:03:24] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773865 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [01:12:34] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 77 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:13:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23229 and previous config saved to /var/cache/conftool/dbconfig/20220327-011324-ladsgroup.json [01:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:17:22] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773966 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [01:18:52] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 60 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:28:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23230 and previous config saved to /var/cache/conftool/dbconfig/20220327-012829-ladsgroup.json [01:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:02] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 30.71 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:38:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23231 and previous config saved to /var/cache/conftool/dbconfig/20220327-014335-ladsgroup.json [01:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:14] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:58:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23232 and previous config saved to /var/cache/conftool/dbconfig/20220327-015840-ladsgroup.json [01:58:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [01:58:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [01:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23233 and previous config saved to /var/cache/conftool/dbconfig/20220327-015848-ladsgroup.json [01:58:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:05:20] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 78 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:11:32] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 60 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:22:20] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 82 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:25:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23234 and previous config saved to /var/cache/conftool/dbconfig/20220327-022552-ladsgroup.json [02:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23235 and previous config saved to /var/cache/conftool/dbconfig/20220327-024057-ladsgroup.json [02:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23236 and previous config saved to /var/cache/conftool/dbconfig/20220327-025603-ladsgroup.json [02:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23237 and previous config saved to /var/cache/conftool/dbconfig/20220327-031108-ladsgroup.json [03:11:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [03:11:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [03:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:11:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23238 and previous config saved to /var/cache/conftool/dbconfig/20220327-031115-ladsgroup.json [03:11:16] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 59 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:02] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 70 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:35:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23239 and previous config saved to /var/cache/conftool/dbconfig/20220327-033526-ladsgroup.json [03:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:32] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:50:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23240 and previous config saved to /var/cache/conftool/dbconfig/20220327-035031-ladsgroup.json [03:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:00] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23241 and previous config saved to /var/cache/conftool/dbconfig/20220327-040536-ladsgroup.json [04:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:32] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:09:12] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: debian-weekly-rebuild.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:14:32] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 23696 bytes in 0.248 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:17:22] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 57 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:20:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23242 and previous config saved to /var/cache/conftool/dbconfig/20220327-042041-ladsgroup.json [04:20:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:20:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:36] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 90 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:39:50] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 58 probes of 674 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:42:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [04:42:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [04:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23243 and previous config saved to /var/cache/conftool/dbconfig/20220327-044235-ladsgroup.json [04:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:05:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23244 and previous config saved to /var/cache/conftool/dbconfig/20220327-050545-ladsgroup.json [05:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:20:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23245 and previous config saved to /var/cache/conftool/dbconfig/20220327-052050-ladsgroup.json [05:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:08] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:35:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23246 and previous config saved to /var/cache/conftool/dbconfig/20220327-053555-ladsgroup.json [05:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:36] PROBLEM - SSH on ml-serve-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:38:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:58] (KubernetesCalicoDown) firing: ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:42:58] RECOVERY - SSH on ml-serve-ctrl1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:43:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:48:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:51:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23247 and previous config saved to /var/cache/conftool/dbconfig/20220327-055100-ladsgroup.json [05:51:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [05:51:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [05:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23248 and previous config saved to /var/cache/conftool/dbconfig/20220327-055108-ladsgroup.json [05:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:26:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23249 and previous config saved to /var/cache/conftool/dbconfig/20220327-062641-ladsgroup.json [06:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:41:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23250 and previous config saved to /var/cache/conftool/dbconfig/20220327-064146-ladsgroup.json [06:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23251 and previous config saved to /var/cache/conftool/dbconfig/20220327-065651-ladsgroup.json [06:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:48] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220327T0700) [07:11:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23252 and previous config saved to /var/cache/conftool/dbconfig/20220327-071156-ladsgroup.json [07:11:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [07:11:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [07:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:12:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23253 and previous config saved to /var/cache/conftool/dbconfig/20220327-071203-ladsgroup.json [07:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23254 and previous config saved to /var/cache/conftool/dbconfig/20220327-081218-ladsgroup.json [08:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:18:54] PROBLEM - SSH on ml-serve-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:19:58] (KubernetesCalicoDown) firing: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:23:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:23:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:27:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23255 and previous config saved to /var/cache/conftool/dbconfig/20220327-082723-ladsgroup.json [08:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:44] RECOVERY - SSH on ml-serve-ctrl1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:28:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:28:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:29:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:42:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23256 and previous config saved to /var/cache/conftool/dbconfig/20220327-084228-ladsgroup.json [08:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:50] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:57:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23257 and previous config saved to /var/cache/conftool/dbconfig/20220327-085733-ladsgroup.json [08:57:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:57:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23258 and previous config saved to /var/cache/conftool/dbconfig/20220327-085741-ladsgroup.json [08:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:58] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:24:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23259 and previous config saved to /var/cache/conftool/dbconfig/20220327-092459-ladsgroup.json [09:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:40:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23260 and previous config saved to /var/cache/conftool/dbconfig/20220327-094004-ladsgroup.json [09:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23261 and previous config saved to /var/cache/conftool/dbconfig/20220327-095509-ladsgroup.json [09:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:10:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23262 and previous config saved to /var/cache/conftool/dbconfig/20220327-101014-ladsgroup.json [10:10:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:10:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:10:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23263 and previous config saved to /var/cache/conftool/dbconfig/20220327-101022-ladsgroup.json [10:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23264 and previous config saved to /var/cache/conftool/dbconfig/20220327-103447-ladsgroup.json [10:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:35:10] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:08] (03PS1) 10Majavah: Add developer-portal chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) [10:41:10] (03PS1) 10Majavah: helmfile.d: add developer-portal [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) [10:42:17] 10SRE-swift-storage: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.9e/9/9e/Christopher_Wilbrand.jpg - https://phabricator.wikimedia.org/T304788 (10Wurgl) [10:45:44] PROBLEM - SSH on ml-serve-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23265 and previous config saved to /var/cache/conftool/dbconfig/20220327-104952-ladsgroup.json [10:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:58] (KubernetesCalicoDown) firing: ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:54:12] RECOVERY - SSH on ml-serve-ctrl1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:57:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23266 and previous config saved to /var/cache/conftool/dbconfig/20220327-110457-ladsgroup.json [11:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23267 and previous config saved to /var/cache/conftool/dbconfig/20220327-112003-ladsgroup.json [11:20:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:20:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:09] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:28] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:36:14] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:41:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:41:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23268 and previous config saved to /var/cache/conftool/dbconfig/20220327-114152-ladsgroup.json [11:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:06:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23269 and previous config saved to /var/cache/conftool/dbconfig/20220327-120604-ladsgroup.json [12:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:10:18] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:12:24] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.201 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:21:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23270 and previous config saved to /var/cache/conftool/dbconfig/20220327-122110-ladsgroup.json [12:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:36:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23271 and previous config saved to /var/cache/conftool/dbconfig/20220327-123615-ladsgroup.json [12:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23272 and previous config saved to /var/cache/conftool/dbconfig/20220327-125120-ladsgroup.json [12:51:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:51:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:51:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23273 and previous config saved to /var/cache/conftool/dbconfig/20220327-125128-ladsgroup.json [12:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:20] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:58:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23274 and previous config saved to /var/cache/conftool/dbconfig/20220327-125842-ladsgroup.json [12:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:01:02] PROBLEM - SSH on ml-serve-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:01:14] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:01:44] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:02:58] (KubernetesCalicoDown) firing: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:03:32] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:03:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:04:00] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:05:48] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:08:03] 10SRE-swift-storage: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.9e/9/9e/Christopher_Wilbrand.jpg - https://phabricator.wikimedia.org/T304788 (10De728631) This could be fixed by deleting and restoring the file, so it was still present on the server. It is also a recurring error (cf. [[ https://com... [13:13:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23275 and previous config saved to /var/cache/conftool/dbconfig/20220327-131347-ladsgroup.json [13:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:32] RECOVERY - SSH on ml-serve-ctrl1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:16:55] (03PS4) 10Jforrester: TimedMediaHandler: Make videojs the only player on all group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612348 (https://phabricator.wikimedia.org/T248418) [13:16:57] (03PS4) 10Jforrester: TimedMediaHandler: Make videojs the only player everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612349 (https://phabricator.wikimedia.org/T248418) [13:16:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:18:48] 10SRE-swift-storage: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.9e/9/9e/Christopher_Wilbrand.jpg - https://phabricator.wikimedia.org/T304788 (10De728631) See also T238695 where undeletion apparently did not solve the problem. While the media player is present in the preview, the underlying sound... [13:18:51] 10SRE-swift-storage: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.9e/9/9e/Christopher_Wilbrand.jpg - https://phabricator.wikimedia.org/T304788 (10Wurgl) Thx! [13:27:56] PROBLEM - SSH on ml-serve-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:28:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23276 and previous config saved to /var/cache/conftool/dbconfig/20220327-132852-ladsgroup.json [13:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:34:30] RECOVERY - SSH on ml-serve-ctrl1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:37:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:38:24] 10SRE-swift-storage: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.9e/9/9e/Christopher_Wilbrand.jpg - https://phabricator.wikimedia.org/T304788 (10De728631) Apparently the upload process is also affected by this bug. While patrolling new files at Commons I just found https://commons.wikimedia.org/wi... [13:38:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:43:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23277 and previous config saved to /var/cache/conftool/dbconfig/20220327-134358-ladsgroup.json [13:44:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:44:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:44:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23278 and previous config saved to /var/cache/conftool/dbconfig/20220327-134411-ladsgroup.json [13:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:36] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled ht [13:55:36] kitech.wikimedia.org/wiki/PyBal [13:57:26] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled ht [13:57:26] kitech.wikimedia.org/wiki/PyBal [14:00:04] PROBLEM - WDQS SPARQL on wdqs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:00:34] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:03:38] PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:05:03] PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:06:48] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23279 and previous config saved to /var/cache/conftool/dbconfig/20220327-140825-ladsgroup.json [14:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:09:02] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:10] (03PS1) 10Tpt: Removes the ProofreadPageUseStatusChangeTags option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774004 (https://phabricator.wikimedia.org/T304795) [14:15:44] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:59] hello folks, anybody checking wdqs? [14:17:11] Cc: gehel, dcausse, ryankemper [14:18:17] !log restart blazegraph on wdqs2003 [14:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:22] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.809 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:20:43] RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.237 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:20:49] !log roll restart of wqds-blazegraph-public codfw [14:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:08] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 6.823 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:22:54] RECOVERY - WDQS SPARQL on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.381 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:22:56] RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.456 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:23:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23280 and previous config saved to /var/cache/conftool/dbconfig/20220327-142330-ladsgroup.json [14:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:48] RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:24:06] RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:24:50] I am checking metrics in https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&from=now-3h&to=now and they look reasonably ok [14:25:31] the thread count is a little weird [14:25:46] RECOVERY - WDQS SPARQL on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.444 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:25:50] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.238 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:26:18] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:26:36] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:38] RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:26:38] RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.335 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:26:42] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:34:57] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2682 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [14:35:48] seriously? [14:35:51] On phone [14:35:59] Should I get back home? [14:36:00] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.7742 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:36:21] Amir1: I'm home if we are needed, no worries [14:36:36] Thanks [14:36:45] there seem to be a lot more POSTs [14:36:54] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=api_appserver&var-datasource=eqiad%20prometheus%2Fops [14:37:32] elukey: _security? [14:38:08] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:38:17] Amir1: still trying to check [14:38:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23281 and previous config saved to /var/cache/conftool/dbconfig/20220327-143835-ladsgroup.json [14:38:37] Thanks [14:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:40] * akosiaris around [14:53:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23282 and previous config saved to /var/cache/conftool/dbconfig/20220327-145341-ladsgroup.json [14:53:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:53:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:15] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5529 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [14:58:34] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04839 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:06:31] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2677 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [15:09:50] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6613 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:15:33] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5115 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [15:15:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:15:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:10] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:33:20] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5161 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:37:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [15:37:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [15:37:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [15:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [15:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:24] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03226 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:43:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:43:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23283 and previous config saved to /var/cache/conftool/dbconfig/20220327-154357-ladsgroup.json [15:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:58:32] PROBLEM - MariaDB Replica Lag: s2 on db2148 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 881.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:58:40] PROBLEM - MariaDB Replica Lag: s2 on db2107 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 890.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:02] PROBLEM - MariaDB Replica Lag: s2 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1254.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:00:36] RECOVERY - MariaDB Replica Lag: s2 on db2148 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:00:46] RECOVERY - MariaDB Replica Lag: s2 on db2107 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:01:06] RECOVERY - MariaDB Replica Lag: s2 on db2101 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:10:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23284 and previous config saved to /var/cache/conftool/dbconfig/20220327-161006-ladsgroup.json [16:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:25:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23285 and previous config saved to /var/cache/conftool/dbconfig/20220327-162511-ladsgroup.json [16:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23286 and previous config saved to /var/cache/conftool/dbconfig/20220327-164017-ladsgroup.json [16:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23287 and previous config saved to /var/cache/conftool/dbconfig/20220327-165522-ladsgroup.json [16:55:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [16:55:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [16:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:55:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23288 and previous config saved to /var/cache/conftool/dbconfig/20220327-165530-ladsgroup.json [16:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:04] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:39:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:55:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23289 and previous config saved to /var/cache/conftool/dbconfig/20220327-175544-ladsgroup.json [17:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:06:40] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:10:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23290 and previous config saved to /var/cache/conftool/dbconfig/20220327-181049-ladsgroup.json [18:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23291 and previous config saved to /var/cache/conftool/dbconfig/20220327-182554-ladsgroup.json [18:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23292 and previous config saved to /var/cache/conftool/dbconfig/20220327-184059-ladsgroup.json [18:41:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:41:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:41:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23293 and previous config saved to /var/cache/conftool/dbconfig/20220327-184107-ladsgroup.json [18:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23294 and previous config saved to /var/cache/conftool/dbconfig/20220327-190742-ladsgroup.json [19:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:08:30] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:40] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5645 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:09:55] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2933 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [19:11:42] 👋 looking [19:22:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23295 and previous config saved to /var/cache/conftool/dbconfig/20220327-192247-ladsgroup.json [19:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:52] <_joe_> !log restarting php on mw1380 [19:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:13] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [19:35:51] <_joe_> !log $ sudo cumin -b1 -s20 'A:mw-api and P{mw13[56-82].eqiad.wmnet}' 'restart-php7.2-fpm' [19:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P23296 and previous config saved to /var/cache/conftool/dbconfig/20220327-193753-ladsgroup.json [19:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:56] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [19:47:40] PROBLEM - SSH on ml-serve-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:47:58] (KubernetesCalicoDown) firing: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:48:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:50:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:52:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23297 and previous config saved to /var/cache/conftool/dbconfig/20220327-195258-ladsgroup.json [19:53:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [19:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:53:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [19:53:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [19:53:05] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [19:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [19:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [19:58:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [19:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:15] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5299 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [19:59:38] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [20:04:45] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:07:18] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [20:07:30] (03PS1) 10Zabe: Migrate $wmfServiceConfig to $wmgServiceConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774019 (https://phabricator.wikimedia.org/T45956) [20:08:19] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [20:11:06] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi I did all the switches up-link to both core routers, please double check and see if all looks good. Thanks [20:11:17] 10SRE, 10ChangeProp, 10envoy, 10serviceops: Investigate shorter-lived persistent connections for Envoy - https://phabricator.wikimedia.org/T304799 (10RLazarus) [20:12:38] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:13:30] 10SRE, 10serviceops: Set API server weights - https://phabricator.wikimedia.org/T304800 (10RLazarus) [20:14:59] PROBLEM - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:15:30] 10SRE, 10SRE-swift-storage: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.9e/9/9e/Christopher_Wilbrand.jpg - https://phabricator.wikimedia.org/T304788 (10Peachey88) [20:15:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:17:07] RECOVERY - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 1.153 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:20:00] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: sync [20:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:16] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: sync [20:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:21:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:21:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:36] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:23:32] RECOVERY - SSH on ml-serve-ctrl1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:23:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:27:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:45:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [20:45:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [20:45:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23298 and previous config saved to /var/cache/conftool/dbconfig/20220327-204604-ladsgroup.json [20:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:09:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23299 and previous config saved to /var/cache/conftool/dbconfig/20220327-210917-ladsgroup.json [21:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:13:21] PROBLEM - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:15:27] RECOVERY - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.144 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:24:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23300 and previous config saved to /var/cache/conftool/dbconfig/20220327-212422-ladsgroup.json [21:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:22] 10SRE, 10ChangeProp, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Investigate shorter-lived persistent connections for Envoy - https://phabricator.wikimedia.org/T304799 (10RhinosF1) [21:25:44] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10RhinosF1) [21:39:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P23301 and previous config saved to /var/cache/conftool/dbconfig/20220327-213927-ladsgroup.json [21:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:06] (03PS1) 10Stang: Throttle: Add rule for Bard College class project on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774023 (https://phabricator.wikimedia.org/T304687) [21:54:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298565)', diff saved to https://phabricator.wikimedia.org/P23302 and previous config saved to /var/cache/conftool/dbconfig/20220327-215432-ladsgroup.json [21:54:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [21:54:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [21:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:40] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:54:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23303 and previous config saved to /var/cache/conftool/dbconfig/20220327-215440-ladsgroup.json [21:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23304 and previous config saved to /var/cache/conftool/dbconfig/20220327-220143-ladsgroup.json [22:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:06:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:12:14] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:16:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23305 and previous config saved to /var/cache/conftool/dbconfig/20220327-221649-ladsgroup.json [22:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:55] =hiya tim! [22:20:58] o: [22:31:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P23306 and previous config saved to /var/cache/conftool/dbconfig/20220327-223154-ladsgroup.json [22:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298565)', diff saved to https://phabricator.wikimedia.org/P23307 and previous config saved to /var/cache/conftool/dbconfig/20220327-224659-ladsgroup.json [22:47:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:47:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:47:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23308 and previous config saved to /var/cache/conftool/dbconfig/20220327-224707-ladsgroup.json [22:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:18] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:54:26] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:09:22] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23309 and previous config saved to /var/cache/conftool/dbconfig/20220327-231001-ladsgroup.json [23:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:11:38] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:25:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23310 and previous config saved to /var/cache/conftool/dbconfig/20220327-232506-ladsgroup.json [23:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P23311 and previous config saved to /var/cache/conftool/dbconfig/20220327-234011-ladsgroup.json [23:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23312 and previous config saved to /var/cache/conftool/dbconfig/20220327-235516-ladsgroup.json [23:55:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [23:55:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [23:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log