[00:01:18] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:03:36] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:05:16] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10jmads) We can use June 30th for now. That's when the current contract is set to expire. [00:05:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24679 and previous config saved to /var/cache/conftool/dbconfig/20220416-000528-ladsgroup.json [00:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:25] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be1071.eqiad.wmnet with OS stretch [00:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:31] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host ms-be1071.eqiad.wmnet with OS stretch executed with errors: -... [00:20:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P24680 and previous config saved to /var/cache/conftool/dbconfig/20220416-002033-ladsgroup.json [00:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:35:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P24681 and previous config saved to /var/cache/conftool/dbconfig/20220416-003538-ladsgroup.json [00:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:43] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:05:10] PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 4013 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [03:21:46] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:10] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.149 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:32:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:46:52] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:48:04] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:19:44] (03PS1) 10Majavah: openldap: ldap is no longer authoritative for keystone projects [puppet] - 10https://gerrit.wikimedia.org/r/783190 [06:20:19] (03CR) 10jerkins-bot: [V: 04-1] openldap: ldap is no longer authoritative for keystone projects [puppet] - 10https://gerrit.wikimedia.org/r/783190 (owner: 10Majavah) [06:21:41] (03PS2) 10Majavah: openldap: ldap is no longer authoritative for keystone projects [puppet] - 10https://gerrit.wikimedia.org/r/783190 [06:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220416T0700) [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:14:57] PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 4290 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [08:08:56] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 32.15 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:09:10] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 11.02 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:10:18] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 15.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:14:48] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 83.83 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:15:40] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 101.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:15:56] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:36:58] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:40:48] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:02:13] (03PS2) 10Majavah: kubeadm: label nodes with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) [09:04:29] (03PS3) 10Majavah: kubeadm: label nodes with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) [09:05:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34866/console" [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) (owner: 10Majavah) [09:08:45] (03PS4) 10Majavah: kubeadm: label nodes with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) [09:09:29] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34867/console" [puppet] - 10https://gerrit.wikimedia.org/r/773933 (https://phabricator.wikimedia.org/T304708) (owner: 10Majavah) [09:41:56] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:15:48] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:17:00] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:24:53] PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 4490 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [11:34:22] around [11:34:27] let me check [11:50:54] (03PS11) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [11:51:27] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [11:54:22] (03PS12) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [11:54:54] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [11:55:36] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:37:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:57:56] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:04:38] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:24:53] (03PS1) 10Zabe: role::mediawiki::maintenance: remove reference to cron [puppet] - 10https://gerrit.wikimedia.org/r/783410 [15:25:26] (03CR) 10jerkins-bot: [V: 04-1] role::mediawiki::maintenance: remove reference to cron [puppet] - 10https://gerrit.wikimedia.org/r/783410 (owner: 10Zabe) [15:26:24] (03PS2) 10Zabe: role::mediawiki::maintenance: remove reference to cron [puppet] - 10https://gerrit.wikimedia.org/r/783410 [15:29:42] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34870/" [puppet] - 10https://gerrit.wikimedia.org/r/783410 (owner: 10Zabe) [15:34:49] PROBLEM - exim queue #page on mx1001 is CRITICAL: CRITICAL: 4092 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [15:36:30] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/34871/" [puppet] - 10https://gerrit.wikimedia.org/r/776349 (https://phabricator.wikimedia.org/T257473) (owner: 10Zabe) [16:37:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:10:40] !log drop deferred email to tools.libraryupgrader on mx1001 [17:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:57] RECOVERY - exim queue #page on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim https://grafana.wikimedia.org/d/000000451/mail [17:47:39] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:49:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [17:49:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [17:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [17:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [17:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:50:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:55:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [18:00:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [18:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P24682 and previous config saved to /var/cache/conftool/dbconfig/20220416-180027-ladsgroup.json [18:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:05:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:05:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [18:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [18:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [18:16:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [18:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:10] (03Abandoned) 10Majavah: kubeadm: Update kube-state-metrics to 2.2.4 [puppet] - 10https://gerrit.wikimedia.org/r/740323 (https://phabricator.wikimedia.org/T295190) (owner: 10Majavah) [18:24:20] (03PS1) 10Stang: Wikispecies: update logo to prevent being obscured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783417 (https://phabricator.wikimedia.org/T306037) [18:25:42] (03PS1) 10Majavah: kubeadm: remove metrics files [puppet] - 10https://gerrit.wikimedia.org/r/783418 [18:26:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [18:26:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [18:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298565)', diff saved to https://phabricator.wikimedia.org/P24683 and previous config saved to /var/cache/conftool/dbconfig/20220416-182606-ladsgroup.json [18:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:26:18] (03PS2) 10Stang: Wikispecies: update logo to prevent being obscured [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783417 (https://phabricator.wikimedia.org/T306037) [18:30:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298565)', diff saved to https://phabricator.wikimedia.org/P24684 and previous config saved to /var/cache/conftool/dbconfig/20220416-183032-ladsgroup.json [18:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:45:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P24685 and previous config saved to /var/cache/conftool/dbconfig/20220416-184537-ladsgroup.json [18:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:48:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P24686 and previous config saved to /var/cache/conftool/dbconfig/20220416-190041-ladsgroup.json [19:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:00:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P24687 and previous config saved to /var/cache/conftool/dbconfig/20220416-190049-ladsgroup.json [19:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:15:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P24688 and previous config saved to /var/cache/conftool/dbconfig/20220416-191546-ladsgroup.json [19:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298565)', diff saved to https://phabricator.wikimedia.org/P24689 and previous config saved to /var/cache/conftool/dbconfig/20220416-191554-ladsgroup.json [19:15:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:15:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P24690 and previous config saved to /var/cache/conftool/dbconfig/20220416-191602-ladsgroup.json [19:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P24691 and previous config saved to /var/cache/conftool/dbconfig/20220416-192800-ladsgroup.json [19:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:30:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P24692 and previous config saved to /var/cache/conftool/dbconfig/20220416-193052-ladsgroup.json [19:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:54] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 51.8 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:34:00] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 20.3 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:37:24] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 100.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:38:32] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [19:38:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:38:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [19:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P24693 and previous config saved to /var/cache/conftool/dbconfig/20220416-193901-ladsgroup.json [19:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:43:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P24694 and previous config saved to /var/cache/conftool/dbconfig/20220416-194305-ladsgroup.json [19:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P24695 and previous config saved to /var/cache/conftool/dbconfig/20220416-194557-ladsgroup.json [19:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:46:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [19:46:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [19:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [19:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [19:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [19:56:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [19:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P24696 and previous config saved to /var/cache/conftool/dbconfig/20220416-195623-ladsgroup.json [19:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:58:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P24697 and previous config saved to /var/cache/conftool/dbconfig/20220416-195810-ladsgroup.json [19:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:30] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:07:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298565)', diff saved to https://phabricator.wikimedia.org/P24698 and previous config saved to /var/cache/conftool/dbconfig/20220416-200711-ladsgroup.json [20:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:11:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P24699 and previous config saved to /var/cache/conftool/dbconfig/20220416-201128-ladsgroup.json [20:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P24700 and previous config saved to /var/cache/conftool/dbconfig/20220416-201315-ladsgroup.json [20:13:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [20:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [20:13:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P24701 and previous config saved to /var/cache/conftool/dbconfig/20220416-201323-ladsgroup.json [20:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:43] (03CR) 10MarcoAurelio: "This change is ready for review." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [20:22:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P24702 and previous config saved to /var/cache/conftool/dbconfig/20220416-202217-ladsgroup.json [20:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P24703 and previous config saved to /var/cache/conftool/dbconfig/20220416-202521-ladsgroup.json [20:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:26:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P24704 and previous config saved to /var/cache/conftool/dbconfig/20220416-202633-ladsgroup.json [20:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:37:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P24705 and previous config saved to /var/cache/conftool/dbconfig/20220416-203722-ladsgroup.json [20:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P24706 and previous config saved to /var/cache/conftool/dbconfig/20220416-204026-ladsgroup.json [20:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P24707 and previous config saved to /var/cache/conftool/dbconfig/20220416-204138-ladsgroup.json [20:41:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [20:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [20:41:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P24708 and previous config saved to /var/cache/conftool/dbconfig/20220416-204147-ladsgroup.json [20:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298565)', diff saved to https://phabricator.wikimedia.org/P24709 and previous config saved to /var/cache/conftool/dbconfig/20220416-205227-ladsgroup.json [20:52:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [20:52:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [20:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298565)', diff saved to https://phabricator.wikimedia.org/P24710 and previous config saved to /var/cache/conftool/dbconfig/20220416-205234-ladsgroup.json [20:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P24711 and previous config saved to /var/cache/conftool/dbconfig/20220416-205531-ladsgroup.json [20:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P24712 and previous config saved to /var/cache/conftool/dbconfig/20220416-205906-ladsgroup.json [20:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:04:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298565)', diff saved to https://phabricator.wikimedia.org/P24713 and previous config saved to /var/cache/conftool/dbconfig/20220416-210403-ladsgroup.json [21:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:40] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:10:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P24714 and previous config saved to /var/cache/conftool/dbconfig/20220416-211037-ladsgroup.json [21:10:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [21:10:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [21:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:43] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298565)', diff saved to https://phabricator.wikimedia.org/P24715 and previous config saved to /var/cache/conftool/dbconfig/20220416-211044-ladsgroup.json [21:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P24716 and previous config saved to /var/cache/conftool/dbconfig/20220416-211411-ladsgroup.json [21:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298565)', diff saved to https://phabricator.wikimedia.org/P24717 and previous config saved to /var/cache/conftool/dbconfig/20220416-211506-ladsgroup.json [21:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:28] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P24718 and previous config saved to /var/cache/conftool/dbconfig/20220416-211908-ladsgroup.json [21:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P24719 and previous config saved to /var/cache/conftool/dbconfig/20220416-212916-ladsgroup.json [21:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P24720 and previous config saved to /var/cache/conftool/dbconfig/20220416-213011-ladsgroup.json [21:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P24721 and previous config saved to /var/cache/conftool/dbconfig/20220416-213413-ladsgroup.json [21:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P24722 and previous config saved to /var/cache/conftool/dbconfig/20220416-214421-ladsgroup.json [21:44:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [21:44:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [21:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P24723 and previous config saved to /var/cache/conftool/dbconfig/20220416-214429-ladsgroup.json [21:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P24724 and previous config saved to /var/cache/conftool/dbconfig/20220416-214516-ladsgroup.json [21:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:49:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298565)', diff saved to https://phabricator.wikimedia.org/P24725 and previous config saved to /var/cache/conftool/dbconfig/20220416-214918-ladsgroup.json [21:49:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [21:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [21:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298565)', diff saved to https://phabricator.wikimedia.org/P24726 and previous config saved to /var/cache/conftool/dbconfig/20220416-214926-ladsgroup.json [21:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:00:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298565)', diff saved to https://phabricator.wikimedia.org/P24727 and previous config saved to /var/cache/conftool/dbconfig/20220416-220021-ladsgroup.json [22:00:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [22:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [22:00:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:00:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:00:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298565)', diff saved to https://phabricator.wikimedia.org/P24728 and previous config saved to /var/cache/conftool/dbconfig/20220416-220034-ladsgroup.json [22:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298565)', diff saved to https://phabricator.wikimedia.org/P24729 and previous config saved to /var/cache/conftool/dbconfig/20220416-220055-ladsgroup.json [22:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298565)', diff saved to https://phabricator.wikimedia.org/P24730 and previous config saved to /var/cache/conftool/dbconfig/20220416-220453-ladsgroup.json [22:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:48] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P24731 and previous config saved to /var/cache/conftool/dbconfig/20220416-221600-ladsgroup.json [22:16:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P24732 and previous config saved to /var/cache/conftool/dbconfig/20220416-221601-ladsgroup.json [22:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P24733 and previous config saved to /var/cache/conftool/dbconfig/20220416-221958-ladsgroup.json [22:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P24734 and previous config saved to /var/cache/conftool/dbconfig/20220416-223105-ladsgroup.json [22:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:35:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P24735 and previous config saved to /var/cache/conftool/dbconfig/20220416-223504-ladsgroup.json [22:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298565)', diff saved to https://phabricator.wikimedia.org/P24736 and previous config saved to /var/cache/conftool/dbconfig/20220416-224610-ladsgroup.json [22:46:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P24737 and previous config saved to /var/cache/conftool/dbconfig/20220416-224610-ladsgroup.json [22:46:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [22:46:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [22:46:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [22:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [22:46:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298565)', diff saved to https://phabricator.wikimedia.org/P24738 and previous config saved to /var/cache/conftool/dbconfig/20220416-224617-ladsgroup.json [22:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P24739 and previous config saved to /var/cache/conftool/dbconfig/20220416-224618-ladsgroup.json [22:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298565)', diff saved to https://phabricator.wikimedia.org/P24740 and previous config saved to /var/cache/conftool/dbconfig/20220416-225009-ladsgroup.json [22:50:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [22:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [22:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P24741 and previous config saved to /var/cache/conftool/dbconfig/20220416-225017-ladsgroup.json [22:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298565)', diff saved to https://phabricator.wikimedia.org/P24742 and previous config saved to /var/cache/conftool/dbconfig/20220416-225744-ladsgroup.json [22:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:04:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P24743 and previous config saved to /var/cache/conftool/dbconfig/20220416-230441-ladsgroup.json [23:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:12:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P24744 and previous config saved to /var/cache/conftool/dbconfig/20220416-231249-ladsgroup.json [23:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P24745 and previous config saved to /var/cache/conftool/dbconfig/20220416-231946-ladsgroup.json [23:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P24746 and previous config saved to /var/cache/conftool/dbconfig/20220416-232754-ladsgroup.json [23:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P24747 and previous config saved to /var/cache/conftool/dbconfig/20220416-233451-ladsgroup.json [23:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298565)', diff saved to https://phabricator.wikimedia.org/P24748 and previous config saved to /var/cache/conftool/dbconfig/20220416-234259-ladsgroup.json [23:43:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [23:43:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P24749 and previous config saved to /var/cache/conftool/dbconfig/20220416-234307-ladsgroup.json [23:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P24750 and previous config saved to /var/cache/conftool/dbconfig/20220416-234956-ladsgroup.json [23:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:50:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P24751 and previous config saved to /var/cache/conftool/dbconfig/20220416-235031-ladsgroup.json [23:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P24752 and previous config saved to /var/cache/conftool/dbconfig/20220416-235449-ladsgroup.json [23:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log