[00:02:19] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:06:16] (03PS1) 10Andrew Bogott: Add a properly private encryption key to heat.conf [puppet] - 10https://gerrit.wikimedia.org/r/800824 (https://phabricator.wikimedia.org/T309407) [00:08:51] (03PS1) 10Andrew Bogott: Add fake auth encryption keys for openstack heat [labs/private] - 10https://gerrit.wikimedia.org/r/800825 (https://phabricator.wikimedia.org/T309407) [00:08:55] (03PS1) 10Andrew Bogott: Move eqiad1 heat fakes to the right dir [labs/private] - 10https://gerrit.wikimedia.org/r/800826 [00:13:24] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add fake auth encryption keys for openstack heat [labs/private] - 10https://gerrit.wikimedia.org/r/800825 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [00:13:34] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Move eqiad1 heat fakes to the right dir [labs/private] - 10https://gerrit.wikimedia.org/r/800826 (owner: 10Andrew Bogott) [00:13:59] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28729 and previous config saved to /var/cache/conftool/dbconfig/20220528-001437-ladsgroup.json [00:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:59] (03CR) 10Andrew Bogott: [C: 03+2] Add a properly private encryption key to heat.conf [puppet] - 10https://gerrit.wikimedia.org/r/800824 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [00:21:03] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:27:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [00:27:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [00:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T298560)', diff saved to https://phabricator.wikimedia.org/P28730 and previous config saved to /var/cache/conftool/dbconfig/20220528-002804-ladsgroup.json [00:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:13] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [00:29:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T309311)', diff saved to https://phabricator.wikimedia.org/P28731 and previous config saved to /var/cache/conftool/dbconfig/20220528-002942-ladsgroup.json [00:29:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [00:29:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [00:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:49] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [00:29:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T309311)', diff saved to https://phabricator.wikimedia.org/P28732 and previous config saved to /var/cache/conftool/dbconfig/20220528-002950-ladsgroup.json [00:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:57] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:46:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T309311)', diff saved to https://phabricator.wikimedia.org/P28733 and previous config saved to /var/cache/conftool/dbconfig/20220528-004649-ladsgroup.json [00:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:56] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [00:48:55] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [00:50:59] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Swift [01:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28734 and previous config saved to /var/cache/conftool/dbconfig/20220528-010154-ladsgroup.json [01:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:31] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: full-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:33] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:08:01] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:12:29] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:16:57] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:16:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28735 and previous config saved to /var/cache/conftool/dbconfig/20220528-011659-ladsgroup.json [01:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T309311)', diff saved to https://phabricator.wikimedia.org/P28736 and previous config saved to /var/cache/conftool/dbconfig/20220528-013204-ladsgroup.json [01:32:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [01:32:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [01:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:11] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [01:32:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T309311)', diff saved to https://phabricator.wikimedia.org/P28737 and previous config saved to /var/cache/conftool/dbconfig/20220528-013212-ladsgroup.json [01:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:28] (03CR) 10Aaron Schulz: [C: 03+1] Switch wgMainStash to db-mainstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799433 (owner: 10Tim Starling) [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:47] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:50:39] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T309311)', diff saved to https://phabricator.wikimedia.org/P28738 and previous config saved to /var/cache/conftool/dbconfig/20220528-020512-ladsgroup.json [02:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:19] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [02:20:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28739 and previous config saved to /var/cache/conftool/dbconfig/20220528-022017-ladsgroup.json [02:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:25] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:24:23] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:35:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28740 and previous config saved to /var/cache/conftool/dbconfig/20220528-023522-ladsgroup.json [02:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298560)', diff saved to https://phabricator.wikimedia.org/P28741 and previous config saved to /var/cache/conftool/dbconfig/20220528-024745-ladsgroup.json [02:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:53] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:50:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T309311)', diff saved to https://phabricator.wikimedia.org/P28742 and previous config saved to /var/cache/conftool/dbconfig/20220528-025027-ladsgroup.json [02:50:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [02:50:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [02:50:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [02:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:34] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [02:50:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [02:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T309311)', diff saved to https://phabricator.wikimedia.org/P28743 and previous config saved to /var/cache/conftool/dbconfig/20220528-025040-ladsgroup.json [02:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:01] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:02:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P28744 and previous config saved to /var/cache/conftool/dbconfig/20220528-030250-ladsgroup.json [03:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T309311)', diff saved to https://phabricator.wikimedia.org/P28745 and previous config saved to /var/cache/conftool/dbconfig/20220528-031059-ladsgroup.json [03:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:07] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [03:17:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P28746 and previous config saved to /var/cache/conftool/dbconfig/20220528-031755-ladsgroup.json [03:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:33] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:24:53] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28747 and previous config saved to /var/cache/conftool/dbconfig/20220528-032604-ladsgroup.json [03:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:53] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:33:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T298560)', diff saved to https://phabricator.wikimedia.org/P28748 and previous config saved to /var/cache/conftool/dbconfig/20220528-033300-ladsgroup.json [03:33:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [03:33:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [03:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:08] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [03:33:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28749 and previous config saved to /var/cache/conftool/dbconfig/20220528-033309-ladsgroup.json [03:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28750 and previous config saved to /var/cache/conftool/dbconfig/20220528-034109-ladsgroup.json [03:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:07] (03PS1) 10KartikMistry: testwiki: Enable Section Translation in 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800833 (https://phabricator.wikimedia.org/T308829) [03:43:36] (03PS1) 10Andrew Bogott: Add fake service user passwords for 'heat' [labs/private] - 10https://gerrit.wikimedia.org/r/800834 [03:53:25] (03PS2) 10Andrew Bogott: Add fake service user passwords for 'heat' [labs/private] - 10https://gerrit.wikimedia.org/r/800834 [03:56:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T309311)', diff saved to https://phabricator.wikimedia.org/P28751 and previous config saved to /var/cache/conftool/dbconfig/20220528-035614-ladsgroup.json [03:56:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [03:56:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [03:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:25] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [03:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:53] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add fake service user passwords for 'heat' [labs/private] - 10https://gerrit.wikimedia.org/r/800834 (owner: 10Andrew Bogott) [03:59:12] (03PS1) 10Andrew Bogott: Heat: reduce number of workers to 3 per host [puppet] - 10https://gerrit.wikimedia.org/r/800835 (https://phabricator.wikimedia.org/T309407) [03:59:14] (03PS1) 10Andrew Bogott: Heat: use 'heat' service user instead of novaadmin for auth [puppet] - 10https://gerrit.wikimedia.org/r/800836 (https://phabricator.wikimedia.org/T309407) [04:00:08] (03CR) 10CI reject: [V: 04-1] Heat: use 'heat' service user instead of novaadmin for auth [puppet] - 10https://gerrit.wikimedia.org/r/800836 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [04:05:37] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:07:17] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:07:36] (03CR) 10Andrew Bogott: [C: 03+2] Heat: reduce number of workers to 3 per host [puppet] - 10https://gerrit.wikimedia.org/r/800835 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [04:09:06] (03PS2) 10Andrew Bogott: Heat: use 'heat' service user instead of novaadmin for auth [puppet] - 10https://gerrit.wikimedia.org/r/800836 (https://phabricator.wikimedia.org/T309407) [04:13:20] (03CR) 10Andrew Bogott: [C: 03+2] Heat: use 'heat' service user instead of novaadmin for auth [puppet] - 10https://gerrit.wikimedia.org/r/800836 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [04:14:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [04:14:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [04:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:05] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:34:07] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:35:37] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:09:23] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:31:47] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:35:15] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220528T0700) [07:03:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:03:29] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:29] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:05:39] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:36:33] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:06:41] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:24:07] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:53:05] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:11:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:11:19] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:20:17] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:53:57] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:11:29] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:41] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:22:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28752 and previous config saved to /var/cache/conftool/dbconfig/20220528-102212-ladsgroup.json [10:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:20] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [10:26:59] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:13] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:33] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:37:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P28753 and previous config saved to /var/cache/conftool/dbconfig/20220528-103718-ladsgroup.json [10:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:55] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:52:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P28754 and previous config saved to /var/cache/conftool/dbconfig/20220528-105223-ladsgroup.json [10:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:21] PROBLEM - Host db2088 is DOWN: PING CRITICAL - Packet loss = 100% [11:07:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28755 and previous config saved to /var/cache/conftool/dbconfig/20220528-110728-ladsgroup.json [11:07:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:07:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:07:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:36] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [11:07:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298560)', diff saved to https://phabricator.wikimedia.org/P28756 and previous config saved to /var/cache/conftool/dbconfig/20220528-110741-ladsgroup.json [11:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:29] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:15:47] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:23:31] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:31:27] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:51:03] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:52:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:56:33] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:57:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [11:57:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [11:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T309311)', diff saved to https://phabricator.wikimedia.org/P28757 and previous config saved to /var/cache/conftool/dbconfig/20220528-115726-ladsgroup.json [11:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:35] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:58:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:58:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:59:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [12:02:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [12:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298560)', diff saved to https://phabricator.wikimedia.org/P28758 and previous config saved to /var/cache/conftool/dbconfig/20220528-120211-ladsgroup.json [12:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:21] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [12:12:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:12:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T309311)', diff saved to https://phabricator.wikimedia.org/P28759 and previous config saved to /var/cache/conftool/dbconfig/20220528-121733-ladsgroup.json [12:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:41] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [12:24:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:24:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:24:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T60674)', diff saved to https://phabricator.wikimedia.org/P28760 and previous config saved to /var/cache/conftool/dbconfig/20220528-122441-ladsgroup.json [12:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:54] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [12:32:33] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:32:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28761 and previous config saved to /var/cache/conftool/dbconfig/20220528-123238-ladsgroup.json [12:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:29] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:42:37] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:47:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P28762 and previous config saved to /var/cache/conftool/dbconfig/20220528-124743-ladsgroup.json [12:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T309311)', diff saved to https://phabricator.wikimedia.org/P28763 and previous config saved to /var/cache/conftool/dbconfig/20220528-130248-ladsgroup.json [13:02:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:02:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:56] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [13:02:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T309311)', diff saved to https://phabricator.wikimedia.org/P28764 and previous config saved to /var/cache/conftool/dbconfig/20220528-130256-ladsgroup.json [13:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T60674)', diff saved to https://phabricator.wikimedia.org/P28765 and previous config saved to /var/cache/conftool/dbconfig/20220528-130742-ladsgroup.json [13:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:48] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [13:08:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:08:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [13:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T309311)', diff saved to https://phabricator.wikimedia.org/P28766 and previous config saved to /var/cache/conftool/dbconfig/20220528-130838-ladsgroup.json [13:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:47] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [13:10:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T309311)', diff saved to https://phabricator.wikimedia.org/P28767 and previous config saved to /var/cache/conftool/dbconfig/20220528-131056-ladsgroup.json [13:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:51] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:15:39] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:22:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28768 and previous config saved to /var/cache/conftool/dbconfig/20220528-132247-ladsgroup.json [13:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298560)', diff saved to https://phabricator.wikimedia.org/P28769 and previous config saved to /var/cache/conftool/dbconfig/20220528-132558-ladsgroup.json [13:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:05] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [13:26:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28770 and previous config saved to /var/cache/conftool/dbconfig/20220528-132607-ladsgroup.json [13:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T309311)', diff saved to https://phabricator.wikimedia.org/P28771 and previous config saved to /var/cache/conftool/dbconfig/20220528-133609-ladsgroup.json [13:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:17] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [13:37:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P28772 and previous config saved to /var/cache/conftool/dbconfig/20220528-133752-ladsgroup.json [13:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P28773 and previous config saved to /var/cache/conftool/dbconfig/20220528-134103-ladsgroup.json [13:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P28774 and previous config saved to /var/cache/conftool/dbconfig/20220528-134113-ladsgroup.json [13:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:49] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:49:21] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298560)', diff saved to https://phabricator.wikimedia.org/P28775 and previous config saved to /var/cache/conftool/dbconfig/20220528-134951-ladsgroup.json [13:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:02] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [13:51:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P28776 and previous config saved to /var/cache/conftool/dbconfig/20220528-135114-ladsgroup.json [13:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T60674)', diff saved to https://phabricator.wikimedia.org/P28777 and previous config saved to /var/cache/conftool/dbconfig/20220528-135257-ladsgroup.json [13:52:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [13:53:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [13:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:04] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [13:53:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T60674)', diff saved to https://phabricator.wikimedia.org/P28778 and previous config saved to /var/cache/conftool/dbconfig/20220528-135305-ladsgroup.json [13:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P28779 and previous config saved to /var/cache/conftool/dbconfig/20220528-135608-ladsgroup.json [13:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T309311)', diff saved to https://phabricator.wikimedia.org/P28780 and previous config saved to /var/cache/conftool/dbconfig/20220528-135618-ladsgroup.json [13:56:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:56:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:24] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [13:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:02:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T309311)', diff saved to https://phabricator.wikimedia.org/P28781 and previous config saved to /var/cache/conftool/dbconfig/20220528-140212-ladsgroup.json [14:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:20] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:04:21] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P28782 and previous config saved to /var/cache/conftool/dbconfig/20220528-140457-ladsgroup.json [14:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P28783 and previous config saved to /var/cache/conftool/dbconfig/20220528-140619-ladsgroup.json [14:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T60674)', diff saved to https://phabricator.wikimedia.org/P28784 and previous config saved to /var/cache/conftool/dbconfig/20220528-140758-ladsgroup.json [14:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:06] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:10:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T309311)', diff saved to https://phabricator.wikimedia.org/P28785 and previous config saved to /var/cache/conftool/dbconfig/20220528-141028-ladsgroup.json [14:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:35] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:11:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298560)', diff saved to https://phabricator.wikimedia.org/P28786 and previous config saved to /var/cache/conftool/dbconfig/20220528-141113-ladsgroup.json [14:11:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:11:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:20] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [14:11:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298560)', diff saved to https://phabricator.wikimedia.org/P28787 and previous config saved to /var/cache/conftool/dbconfig/20220528-141121-ladsgroup.json [14:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P28788 and previous config saved to /var/cache/conftool/dbconfig/20220528-142002-ladsgroup.json [14:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T309311)', diff saved to https://phabricator.wikimedia.org/P28789 and previous config saved to /var/cache/conftool/dbconfig/20220528-142124-ladsgroup.json [14:21:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:21:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:30] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:21:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T309311)', diff saved to https://phabricator.wikimedia.org/P28790 and previous config saved to /var/cache/conftool/dbconfig/20220528-142132-ladsgroup.json [14:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28791 and previous config saved to /var/cache/conftool/dbconfig/20220528-142303-ladsgroup.json [14:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28792 and previous config saved to /var/cache/conftool/dbconfig/20220528-142533-ladsgroup.json [14:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298560)', diff saved to https://phabricator.wikimedia.org/P28793 and previous config saved to /var/cache/conftool/dbconfig/20220528-143507-ladsgroup.json [14:35:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:35:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:14] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [14:35:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298560)', diff saved to https://phabricator.wikimedia.org/P28794 and previous config saved to /var/cache/conftool/dbconfig/20220528-143515-ladsgroup.json [14:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P28795 and previous config saved to /var/cache/conftool/dbconfig/20220528-143808-ladsgroup.json [14:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:21] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:40:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P28796 and previous config saved to /var/cache/conftool/dbconfig/20220528-144038-ladsgroup.json [14:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T309311)', diff saved to https://phabricator.wikimedia.org/P28797 and previous config saved to /var/cache/conftool/dbconfig/20220528-144704-ladsgroup.json [14:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:11] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:53:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T60674)', diff saved to https://phabricator.wikimedia.org/P28798 and previous config saved to /var/cache/conftool/dbconfig/20220528-145313-ladsgroup.json [14:53:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [14:53:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [14:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:21] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:53:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T60674)', diff saved to https://phabricator.wikimedia.org/P28799 and previous config saved to /var/cache/conftool/dbconfig/20220528-145321-ladsgroup.json [14:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T60674)', diff saved to https://phabricator.wikimedia.org/P28800 and previous config saved to /var/cache/conftool/dbconfig/20220528-145532-ladsgroup.json [14:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T309311)', diff saved to https://phabricator.wikimedia.org/P28801 and previous config saved to /var/cache/conftool/dbconfig/20220528-145544-ladsgroup.json [14:55:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:55:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:55:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:50] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:55:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T309311)', diff saved to https://phabricator.wikimedia.org/P28802 and previous config saved to /var/cache/conftool/dbconfig/20220528-145557-ladsgroup.json [14:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28803 and previous config saved to /var/cache/conftool/dbconfig/20220528-150209-ladsgroup.json [15:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T309311)', diff saved to https://phabricator.wikimedia.org/P28804 and previous config saved to /var/cache/conftool/dbconfig/20220528-150348-ladsgroup.json [15:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:56] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:10:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28805 and previous config saved to /var/cache/conftool/dbconfig/20220528-151037-ladsgroup.json [15:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28806 and previous config saved to /var/cache/conftool/dbconfig/20220528-151714-ladsgroup.json [15:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28807 and previous config saved to /var/cache/conftool/dbconfig/20220528-151854-ladsgroup.json [15:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P28808 and previous config saved to /var/cache/conftool/dbconfig/20220528-152542-ladsgroup.json [15:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T309311)', diff saved to https://phabricator.wikimedia.org/P28809 and previous config saved to /var/cache/conftool/dbconfig/20220528-153219-ladsgroup.json [15:32:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [15:32:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [15:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [15:32:27] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:32:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [15:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P28810 and previous config saved to /var/cache/conftool/dbconfig/20220528-153359-ladsgroup.json [15:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:31] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:40:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T60674)', diff saved to https://phabricator.wikimedia.org/P28811 and previous config saved to /var/cache/conftool/dbconfig/20220528-154047-ladsgroup.json [15:40:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:40:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:55] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:40:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T60674)', diff saved to https://phabricator.wikimedia.org/P28812 and previous config saved to /var/cache/conftool/dbconfig/20220528-154055-ladsgroup.json [15:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T309311)', diff saved to https://phabricator.wikimedia.org/P28813 and previous config saved to /var/cache/conftool/dbconfig/20220528-154904-ladsgroup.json [15:49:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:49:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:49:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [15:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:15] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [15:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:53:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T60674)', diff saved to https://phabricator.wikimedia.org/P28814 and previous config saved to /var/cache/conftool/dbconfig/20220528-155436-ladsgroup.json [15:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:43] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:58:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:58:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T309311)', diff saved to https://phabricator.wikimedia.org/P28815 and previous config saved to /var/cache/conftool/dbconfig/20220528-155858-ladsgroup.json [15:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:06] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:07:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T309311)', diff saved to https://phabricator.wikimedia.org/P28816 and previous config saved to /var/cache/conftool/dbconfig/20220528-160730-ladsgroup.json [16:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:36] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:09:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28817 and previous config saved to /var/cache/conftool/dbconfig/20220528-160941-ladsgroup.json [16:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:31] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:16:41] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:58] (03PS2) 10Zabe: deployment-prep: Drop deployment-restbase03, no longer to be used [puppet] - 10https://gerrit.wikimedia.org/r/790424 (https://phabricator.wikimedia.org/T306052) (owner: 10Jforrester) [16:22:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28818 and previous config saved to /var/cache/conftool/dbconfig/20220528-162235-ladsgroup.json [16:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298560)', diff saved to https://phabricator.wikimedia.org/P28819 and previous config saved to /var/cache/conftool/dbconfig/20220528-162327-ladsgroup.json [16:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:34] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [16:24:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P28820 and previous config saved to /var/cache/conftool/dbconfig/20220528-162446-ladsgroup.json [16:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298560)', diff saved to https://phabricator.wikimedia.org/P28821 and previous config saved to /var/cache/conftool/dbconfig/20220528-162801-ladsgroup.json [16:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28822 and previous config saved to /var/cache/conftool/dbconfig/20220528-163740-ladsgroup.json [16:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P28823 and previous config saved to /var/cache/conftool/dbconfig/20220528-163832-ladsgroup.json [16:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T60674)', diff saved to https://phabricator.wikimedia.org/P28824 and previous config saved to /var/cache/conftool/dbconfig/20220528-163951-ladsgroup.json [16:39:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:39:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 7 hosts with reason: Maintenance [16:39:58] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 7 hosts with reason: Maintenance [16:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:40:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P28825 and previous config saved to /var/cache/conftool/dbconfig/20220528-164306-ladsgroup.json [16:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:09] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1008.eqiad.wmnet [16:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:39] yes that's me, the reboot needs to happen on a weekend, please ignore [16:49:17] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:51:58] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1008.eqiad.wmnet [16:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:52:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T60674)', diff saved to https://phabricator.wikimedia.org/P28826 and previous config saved to /var/cache/conftool/dbconfig/20220528-165238-ladsgroup.json [16:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T309311)', diff saved to https://phabricator.wikimedia.org/P28827 and previous config saved to /var/cache/conftool/dbconfig/20220528-165245-ladsgroup.json [16:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [16:52:47] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:52:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [16:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:52] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:52:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T309311)', diff saved to https://phabricator.wikimedia.org/P28828 and previous config saved to /var/cache/conftool/dbconfig/20220528-165253-ladsgroup.json [16:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P28829 and previous config saved to /var/cache/conftool/dbconfig/20220528-165337-ladsgroup.json [16:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [16:55:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [16:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T309311)', diff saved to https://phabricator.wikimedia.org/P28830 and previous config saved to /var/cache/conftool/dbconfig/20220528-165542-ladsgroup.json [16:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T309311)', diff saved to https://phabricator.wikimedia.org/P28831 and previous config saved to /var/cache/conftool/dbconfig/20220528-165758-ladsgroup.json [16:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:04] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:58:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P28832 and previous config saved to /var/cache/conftool/dbconfig/20220528-165811-ladsgroup.json [16:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T60674)', diff saved to https://phabricator.wikimedia.org/P28833 and previous config saved to /var/cache/conftool/dbconfig/20220528-170747-ladsgroup.json [17:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:54] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:08:01] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:08:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298560)', diff saved to https://phabricator.wikimedia.org/P28834 and previous config saved to /var/cache/conftool/dbconfig/20220528-170843-ladsgroup.json [17:08:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [17:08:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [17:08:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:49] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [17:08:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298560)', diff saved to https://phabricator.wikimedia.org/P28835 and previous config saved to /var/cache/conftool/dbconfig/20220528-170856-ladsgroup.json [17:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P28836 and previous config saved to /var/cache/conftool/dbconfig/20220528-171303-ladsgroup.json [17:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298560)', diff saved to https://phabricator.wikimedia.org/P28837 and previous config saved to /var/cache/conftool/dbconfig/20220528-171316-ladsgroup.json [17:13:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:13:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28838 and previous config saved to /var/cache/conftool/dbconfig/20220528-172252-ladsgroup.json [17:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P28839 and previous config saved to /var/cache/conftool/dbconfig/20220528-172808-ladsgroup.json [17:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P28840 and previous config saved to /var/cache/conftool/dbconfig/20220528-173757-ladsgroup.json [17:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T309311)', diff saved to https://phabricator.wikimedia.org/P28841 and previous config saved to /var/cache/conftool/dbconfig/20220528-174313-ladsgroup.json [17:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:21] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:45:27] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:53:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T60674)', diff saved to https://phabricator.wikimedia.org/P28842 and previous config saved to /var/cache/conftool/dbconfig/20220528-175302-ladsgroup.json [17:53:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:53:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:09] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T60674)', diff saved to https://phabricator.wikimedia.org/P28843 and previous config saved to /var/cache/conftool/dbconfig/20220528-175310-ladsgroup.json [17:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T309311)', diff saved to https://phabricator.wikimedia.org/P28844 and previous config saved to /var/cache/conftool/dbconfig/20220528-175556-ladsgroup.json [17:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:03] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [18:09:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T60674)', diff saved to https://phabricator.wikimedia.org/P28845 and previous config saved to /var/cache/conftool/dbconfig/20220528-180904-ladsgroup.json [18:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:12] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:11:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P28846 and previous config saved to /var/cache/conftool/dbconfig/20220528-181101-ladsgroup.json [18:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28847 and previous config saved to /var/cache/conftool/dbconfig/20220528-182409-ladsgroup.json [18:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P28848 and previous config saved to /var/cache/conftool/dbconfig/20220528-182606-ladsgroup.json [18:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P28849 and previous config saved to /var/cache/conftool/dbconfig/20220528-183914-ladsgroup.json [18:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T309311)', diff saved to https://phabricator.wikimedia.org/P28850 and previous config saved to /var/cache/conftool/dbconfig/20220528-184111-ladsgroup.json [18:41:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [18:41:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [18:41:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:19] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [18:41:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T309311)', diff saved to https://phabricator.wikimedia.org/P28851 and previous config saved to /var/cache/conftool/dbconfig/20220528-184125-ladsgroup.json [18:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:23] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:51] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:54:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T60674)', diff saved to https://phabricator.wikimedia.org/P28852 and previous config saved to /var/cache/conftool/dbconfig/20220528-185420-ladsgroup.json [18:54:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:54:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:27] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:54:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T60674)', diff saved to https://phabricator.wikimedia.org/P28853 and previous config saved to /var/cache/conftool/dbconfig/20220528-185428-ladsgroup.json [18:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298560)', diff saved to https://phabricator.wikimedia.org/P28854 and previous config saved to /var/cache/conftool/dbconfig/20220528-185614-ladsgroup.json [18:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:20] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [19:05:40] PROBLEM - MariaDB Replica Lag: s3 #page on db1112 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1391.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:06:06] what's that? [19:06:36] hello hello [19:06:46] marostegui: depooled earlier [19:06:54] https://sal.toolforge.org/production?p=0&q=db1112&d= [19:07:35] Amir1: please adjust the downtime for that schema change [19:07:38] make it double [19:07:50] wait wat [19:07:53] strange, according to SAL it was downtimed for 6h at 18:41, should still be downtimed [19:07:54] It was 24 hours [19:07:58] https://sal.toolforge.org/log/Kf35C4EBa_6PSCT9FKif [19:08:02] It should be downtimed [19:08:16] icinga shows it downtimed as well [19:08:20] That was 30 minutes ago [19:08:31] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=db1112 [19:08:38] It is not downtimed on icinga [19:09:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T60674)', diff saved to https://phabricator.wikimedia.org/P28855 and previous config saved to /var/cache/conftool/dbconfig/20220528-190859-ladsgroup.json [19:09:03] I refreshed it and now it does wtf? [19:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:08] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:09:16] I haven't touched anything [19:09:28] Then maybe some icinga weirdness? [19:09:37] I guess it has to be, can't imagine what though [19:09:50] that's weird, we have been operating the same code for months now, it really can't be the schema change code [19:10:16] the revision change one is the only thing that might take a while and it's 24 hours [19:10:23] the schema change code is definitely innocent, it looks like it called the downtime cookbook correctly [19:10:46] Maybe we need to investigate in icinga [19:10:55] But it is late here to do so, and probably not urgent [19:10:56] and the downtime cookbook looks like it did the right thing on icinga -- but for whatever reason icinga isn't handling the downtime consistently? [19:11:03] So I am going to go back to the evening [19:11:08] yeah -- I'll open a task and we can dig on Monday [19:11:10] I ack'ed the alert [19:11:14] Should I resolve it? [19:11:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P28856 and previous config saved to /var/cache/conftool/dbconfig/20220528-191119-ladsgroup.json [19:11:24] marostegui: yeah [19:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:27] I think let's resolve to avoid another ping tomorrow [19:11:40] done [19:11:46] 👍 [19:11:59] rzl: if possible add me to the icinga task, so I can lurk [19:12:03] will do [19:12:05] here is a wild idea: Anything that is not pooled should not page [19:12:28] RECOVERY - MariaDB Replica Lag: s3 #page on db1112 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:12:29] Amir1: that's always what we try, but it is a manual process at the momento [19:12:36] Anyways, I am out see you all! [19:12:43] enjoy your weekend! [19:12:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T309311)', diff saved to https://phabricator.wikimedia.org/P28857 and previous config saved to /var/cache/conftool/dbconfig/20220528-191253-ladsgroup.json [19:12:53] <3 [19:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:59] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [19:20:36] 10SRE, 10Icinga, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10RLazarus) [19:20:39] ^ done [19:20:46] heading offline again 👋 [19:24:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28858 and previous config saved to /var/cache/conftool/dbconfig/20220528-192404-ladsgroup.json [19:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P28859 and previous config saved to /var/cache/conftool/dbconfig/20220528-192625-ladsgroup.json [19:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28860 and previous config saved to /var/cache/conftool/dbconfig/20220528-192758-ladsgroup.json [19:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P28861 and previous config saved to /var/cache/conftool/dbconfig/20220528-193909-ladsgroup.json [19:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298560)', diff saved to https://phabricator.wikimedia.org/P28862 and previous config saved to /var/cache/conftool/dbconfig/20220528-194130-ladsgroup.json [19:41:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:41:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:37] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [19:41:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28863 and previous config saved to /var/cache/conftool/dbconfig/20220528-194138-ladsgroup.json [19:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28864 and previous config saved to /var/cache/conftool/dbconfig/20220528-194303-ladsgroup.json [19:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:49:07] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:54:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T60674)', diff saved to https://phabricator.wikimedia.org/P28865 and previous config saved to /var/cache/conftool/dbconfig/20220528-195414-ladsgroup.json [19:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:22] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:58:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T309311)', diff saved to https://phabricator.wikimedia.org/P28866 and previous config saved to /var/cache/conftool/dbconfig/20220528-195809-ladsgroup.json [19:58:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:58:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:16] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [19:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:20] 10Puppet, 10Infrastructure-Foundations: Package 'cgroup-bin' has no installation candidate on Debian 11 (modules/mediawiki/manifests/cgroup.pp) - https://phabricator.wikimedia.org/T309449 (10TheresNoTime) [20:21:29] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:22:51] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:23:50] (03PS1) 10Samtar: cgroup: Add different package for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) [20:26:39] (03CR) 10RhinosF1: [C: 03+1] cgroup: Add different package for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/800855 (https://phabricator.wikimedia.org/T309449) (owner: 10Samtar) [20:42:39] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:56:35] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:02:05] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 93 probes of 672 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:05:35] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:07:55] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:08:19] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 65 probes of 672 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:08:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [21:08:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [21:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T309311)', diff saved to https://phabricator.wikimedia.org/P28867 and previous config saved to /var/cache/conftool/dbconfig/20220528-210837-ladsgroup.json [21:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:49] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [21:30:19] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:34:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T309311)', diff saved to https://phabricator.wikimedia.org/P28868 and previous config saved to /var/cache/conftool/dbconfig/20220528-213419-ladsgroup.json [21:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:27] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [21:37:23] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:38:03] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:47:03] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:49:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28869 and previous config saved to /var/cache/conftool/dbconfig/20220528-214924-ladsgroup.json [21:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:49] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:52:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [21:52:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [21:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T307525)', diff saved to https://phabricator.wikimedia.org/P28870 and previous config saved to /var/cache/conftool/dbconfig/20220528-215258-ladsgroup.json [21:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:06] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [21:56:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T307525)', diff saved to https://phabricator.wikimedia.org/P28871 and previous config saved to /var/cache/conftool/dbconfig/20220528-215633-ladsgroup.json [21:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:16] (03PS1) 10Andrew Bogott: Heat: include internal keystone url in heat config [puppet] - 10https://gerrit.wikimedia.org/r/800858 (https://phabricator.wikimedia.org/T309407) [22:04:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28872 and previous config saved to /var/cache/conftool/dbconfig/20220528-220429-ladsgroup.json [22:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:47] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:09:10] (03CR) 10Andrew Bogott: [C: 03+2] Heat: include internal keystone url in heat config [puppet] - 10https://gerrit.wikimedia.org/r/800858 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [22:11:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P28873 and previous config saved to /var/cache/conftool/dbconfig/20220528-221138-ladsgroup.json [22:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T309311)', diff saved to https://phabricator.wikimedia.org/P28874 and previous config saved to /var/cache/conftool/dbconfig/20220528-221934-ladsgroup.json [22:19:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [22:19:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [22:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:42] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [22:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:29] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:26:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P28875 and previous config saved to /var/cache/conftool/dbconfig/20220528-222643-ladsgroup.json [22:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:56] (03PS1) 10Andrew Bogott: heat.conf: add [trustee] section [puppet] - 10https://gerrit.wikimedia.org/r/800861 (https://phabricator.wikimedia.org/T309407) [22:40:14] (03CR) 10Andrew Bogott: [C: 03+2] heat.conf: add [trustee] section [puppet] - 10https://gerrit.wikimedia.org/r/800861 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [22:41:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T307525)', diff saved to https://phabricator.wikimedia.org/P28876 and previous config saved to /var/cache/conftool/dbconfig/20220528-224148-ladsgroup.json [22:41:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [22:41:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [22:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:54] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [22:41:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P28877 and previous config saved to /var/cache/conftool/dbconfig/20220528-224156-ladsgroup.json [22:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:34] (03PS1) 10Andrew Bogott: heat: limit number of engine workers [puppet] - 10https://gerrit.wikimedia.org/r/800862 (https://phabricator.wikimedia.org/T309407) [22:48:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P28878 and previous config saved to /var/cache/conftool/dbconfig/20220528-224821-ladsgroup.json [22:48:26] (03CR) 10Andrew Bogott: [C: 03+2] heat: limit number of engine workers [puppet] - 10https://gerrit.wikimedia.org/r/800862 (https://phabricator.wikimedia.org/T309407) (owner: 10Andrew Bogott) [22:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:28] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [22:55:05] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:03:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28879 and previous config saved to /var/cache/conftool/dbconfig/20220528-230326-ladsgroup.json [23:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:33] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:18:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P28880 and previous config saved to /var/cache/conftool/dbconfig/20220528-231831-ladsgroup.json [23:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T307525)', diff saved to https://phabricator.wikimedia.org/P28881 and previous config saved to /var/cache/conftool/dbconfig/20220528-233336-ladsgroup.json [23:33:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:33:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:44] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [23:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [23:36:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [23:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298560)', diff saved to https://phabricator.wikimedia.org/P28882 and previous config saved to /var/cache/conftool/dbconfig/20220528-233650-ladsgroup.json [23:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:00] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [23:45:48] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring