[00:04:22] Southparkfan: which vandal? [00:26:05] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:28:23] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:38:11] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:14:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28902 and previous config saved to /var/cache/conftool/dbconfig/20220530-011448-ladsgroup.json [01:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:57] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [01:23:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:29:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P28903 and previous config saved to /var/cache/conftool/dbconfig/20220530-012953-ladsgroup.json [01:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P28904 and previous config saved to /var/cache/conftool/dbconfig/20220530-014458-ladsgroup.json [01:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298560)', diff saved to https://phabricator.wikimedia.org/P28905 and previous config saved to /var/cache/conftool/dbconfig/20220530-020003-ladsgroup.json [02:00:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [02:00:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [02:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:11] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:00:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298560)', diff saved to https://phabricator.wikimedia.org/P28906 and previous config saved to /var/cache/conftool/dbconfig/20220530-020011-ladsgroup.json [02:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:20] (03CR) 10Andrew Bogott: [C: 03+2] Rough in manifest and files for OpenStack Magnum [puppet] - 10https://gerrit.wikimedia.org/r/800868 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [02:48:45] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:03:42] (03PS1) 10Andrew Bogott: Magnum: use internal keystone url rather than admin [puppet] - 10https://gerrit.wikimedia.org/r/801011 (https://phabricator.wikimedia.org/T280792) [03:05:07] (03CR) 10Andrew Bogott: [C: 03+2] Magnum: use internal keystone url rather than admin [puppet] - 10https://gerrit.wikimedia.org/r/801011 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [03:21:33] PROBLEM - SSH on ms-be1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:23:31] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:35] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.096 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:29:26] (03PS1) 10Andrew Bogott: Magnum: add haproxy in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/801012 (https://phabricator.wikimedia.org/T280792) [03:29:28] (03PS1) 10Andrew Bogott: Heat: include transport_url for the notification section [puppet] - 10https://gerrit.wikimedia.org/r/801013 (https://phabricator.wikimedia.org/T280792) [03:30:11] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:30:13] RECOVERY - SSH on ms-be1063 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:31:32] (03CR) 10Andrew Bogott: [C: 03+2] Heat: include transport_url for the notification section [puppet] - 10https://gerrit.wikimedia.org/r/801013 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [03:32:18] (03CR) 10Andrew Bogott: [C: 03+2] Magnum: add haproxy in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/801012 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [03:39:55] (03PS1) 10Andrew Bogott: Magnum: move api listening port away from the haproxy port [puppet] - 10https://gerrit.wikimedia.org/r/801014 (https://phabricator.wikimedia.org/T280792) [03:40:55] (03CR) 10Andrew Bogott: [C: 03+2] Magnum: move api listening port away from the haproxy port [puppet] - 10https://gerrit.wikimedia.org/r/801014 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [03:49:47] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:53:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298560)', diff saved to https://phabricator.wikimedia.org/P28907 and previous config saved to /var/cache/conftool/dbconfig/20220530-035322-ladsgroup.json [03:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:30] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [04:08:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28908 and previous config saved to /var/cache/conftool/dbconfig/20220530-040827-ladsgroup.json [04:08:31] PROBLEM - SSH on restbase1018.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:25] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:13:29] (03PS1) 10Andrew Bogott: Magnum: limit number of conductor workers [puppet] - 10https://gerrit.wikimedia.org/r/801016 [04:23:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P28909 and previous config saved to /var/cache/conftool/dbconfig/20220530-042332-ladsgroup.json [04:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:27] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:34:55] (03PS1) 10Sharvaniharan: Stream config for android breadcrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801018 [04:36:42] (03CR) 10Sharvaniharan: "Hi @Ottomata. This is the stream config for the new breadcrumbs schema I just created on secondary repo. please review when you get a chan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801018 (owner: 10Sharvaniharan) [04:38:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298560)', diff saved to https://phabricator.wikimedia.org/P28910 and previous config saved to /var/cache/conftool/dbconfig/20220530-043837-ladsgroup.json [04:38:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [04:38:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [04:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:46] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [04:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:17] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:09:47] RECOVERY - SSH on restbase1018.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:15:36] (03PS1) 10Marostegui: db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801019 (https://phabricator.wikimedia.org/T309303) [05:15:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1184', diff saved to https://phabricator.wikimedia.org/P28911 and previous config saved to /var/cache/conftool/dbconfig/20220530-051555-marostegui.json [05:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:26] (03CR) 10Marostegui: [C: 03+2] db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801019 (https://phabricator.wikimedia.org/T309303) (owner: 10Marostegui) [05:22:39] (03PS1) 10Marostegui: mariadb: Move db1128 from m1 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/801086 (https://phabricator.wikimedia.org/T309303) [05:24:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1128 from m1 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/801086 (https://phabricator.wikimedia.org/T309303) (owner: 10Marostegui) [05:26:23] !log Drop renamed revision_actor_temp on s6 T307906 [05:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:30] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [05:28:42] !log Drop renamed revision_actor_temp on s2 T307906 [05:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:08] (03PS1) 10Marostegui: db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801171 (https://phabricator.wikimedia.org/T309485) [05:33:28] (03CR) 10Marostegui: [C: 03+2] db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/801171 (https://phabricator.wikimedia.org/T309485) (owner: 10Marostegui) [05:35:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2088 (s1 and s2) T309485', diff saved to https://phabricator.wikimedia.org/P28913 and previous config saved to /var/cache/conftool/dbconfig/20220530-053459-marostegui.json [05:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:07] T309485: db2088 crashed - https://phabricator.wikimedia.org/T309485 [05:37:16] 10ops-codfw, 10DBA, 10Patch-For-Review: db2088 crashed - https://phabricator.wikimedia.org/T309485 (10Marostegui) a:03Papaul @Papaul db2088's mgmt interface is also unavailable so I cannot check the logs and/or if the host is up and the network failed. Can you check on-site? Thank you! [06:01:14] !log Drop renamed revision_actor_temp on s7 T307906 [06:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:20] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [06:08:57] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:10:20] !log Drop renamed revision_actor_temp on s5 T307906 [06:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:26] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [06:34:54] (03CR) 10Ayounsi: [C: 03+1] Add new per-rack cloudsw subnets for e4 and f4 to networks data [puppet] - 10https://gerrit.wikimedia.org/r/800730 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [06:35:52] 10SRE, 10SectionTranslation, 10Language-Team (Language-2022-April-June): Deploy cxserver db password in private puppet repository - https://phabricator.wikimedia.org/T309486 (10Marostegui) @MoritzMuehlenhoff is the clinic duty person for the 30th week. Removing DBA tag as this is not something DB specific. [06:36:42] !log Drop renamed revision_actor_temp on s8 T307906 [06:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:48] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [06:37:45] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:39:08] !log Drop renamed revision_actor_temp on s4 T307906 [06:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:05] !log restart kube-api on ml-serve-ctrl1002 as attempt to clear some high api latencies / HTTP 504 due to LIST to a specific knative resource [06:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:47] !log restart kube-api on ml-serve-ctrl2002 as attempt to clear some high api latencies / HTTP 504 due to LIST to a specific knative resource [06:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:19] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:00:05] Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T0700). [07:00:05] koi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] hi [07:02:27] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) [07:02:30] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@3ae51e7]: (no justification provided) (duration: 00m 03s) [07:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:09:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap::master: remove the unused scap::l10nupdate class [puppet] - 10https://gerrit.wikimedia.org/r/799362 (owner: 10Giuseppe Lavagetto) [07:12:01] <_joe_> jouncebot: next [07:12:01] In 0 hour(s) and 47 minute(s): Custom deployment window for session handling fix (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T0800) [07:13:51] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:14:36] (03CR) 10Slyngshede: [C: 03+2] Remove cleanup on unused Fairscheduler for Hadoop. [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:15:39] (03CR) 10Slyngshede: [C: 03+2] Remove cleanup on unused Fairscheduler for Hadoop. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799257 (owner: 10Slyngshede) [07:17:47] (03CR) 10Muehlenhoff: "Can you please also add the header to the update-library.R file (R also supports single line comments with a leading #) and to the README." [puppet] - 10https://gerrit.wikimedia.org/r/800254 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:20:53] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/800251 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:22:24] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/800250 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:23:59] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/800249 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:26:41] (03CR) 10Muehlenhoff: "Can you please also add the header to the juniper-mibs file?" [puppet] - 10https://gerrit.wikimedia.org/r/800248 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:31:22] (03PS1) 10Muehlenhoff: squid: Add two additional headers [puppet] - 10https://gerrit.wikimedia.org/r/801331 [07:32:40] 10SRE, 10Icinga, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10fgiunchedi) Here the log from icinga's perspective on `alert1001`: ` # grep db1112 /srv/icinga-logs/icinga-05-29-2022-00.log | perl -pe 's/\[(\d+)\]/localtime($1)/e' ... S... [07:32:46] (03CR) 10Muehlenhoff: [C: 03+2] squid: Add two additional headers [puppet] - 10https://gerrit.wikimedia.org/r/801331 (owner: 10Muehlenhoff) [07:34:17] <_joe_> !log removing l10update leftovers from deployment servers in production [07:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:07] (03CR) 10Filippo Giunchedi: [C: 03+2] cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 (owner: 10Filippo Giunchedi) [07:36:14] (03PS9) 10Filippo Giunchedi: cfssl: write pretty json [puppet] - 10https://gerrit.wikimedia.org/r/800029 [07:38:55] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:40:48] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] puppetdb: create dbs before grants [puppet] - 10https://gerrit.wikimedia.org/r/800031 (https://phabricator.wikimedia.org/T296550) (owner: 10Filippo Giunchedi) [07:43:23] PROBLEM - very high load average likely xfs on ms-be1066 is CRITICAL: CRITICAL - load average: 109.31, 102.12, 70.19 https://wikitech.wikimedia.org/wiki/Swift [07:46:37] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 113 probes of 680 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:49:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800666 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [07:51:49] 10SRE, 10Wikimedia-Mailing-lists, 10serviceops: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10grin) Old mailman was able to //forward spam to an email address// (supposedly the admin), and I have been using it on my old lists to forward spam to my spam-learning email... [07:52:16] (03PS2) 10Muehlenhoff: purged: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793401 [07:52:17] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:52:53] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 680 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:55:21] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:47] jouncebot: nowandnext [07:57:47] For the next 0 hour(s) and 2 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T0700) [07:57:47] In 0 hour(s) and 2 minute(s): Custom deployment window for session handling fix (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T0800) [07:58:31] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:00:04] tgr: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Custom deployment window for session handling fix . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T0800). [08:01:56] (03CR) 10Muehlenhoff: [C: 03+2] purged: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/793401 (owner: 10Muehlenhoff) [08:02:00] (03Abandoned) 10Giuseppe Lavagetto: httpd: reintroduce the default debian ports.conf where no changes were expected. [puppet] - 10https://gerrit.wikimedia.org/r/798633 (owner: 10Giuseppe Lavagetto) [08:02:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: do not restart jobrunners on deployment [puppet] - 10https://gerrit.wikimedia.org/r/792980 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [08:03:24] (03CR) 10Giuseppe Lavagetto: mediawiki::php: check opcache revalidation in restart script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792982 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [08:03:36] <_joe_> jouncebot: next [08:03:37] In 4 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T1300) [08:03:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: enable restarting php-fpm on deployment [puppet] - 10https://gerrit.wikimedia.org/r/792981 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [08:04:03] (03PS2) 10Giuseppe Lavagetto: scap: enable restarting php-fpm on deployment [puppet] - 10https://gerrit.wikimedia.org/r/792981 (https://phabricator.wikimedia.org/T266055) [08:08:31] !log installing dpkg security updates [08:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:08:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:22] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [08:09:23] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [08:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:10:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:26] 10SRE, 10Security: Cookbook to reboot cassandra nodes - https://phabricator.wikimedia.org/T288975 (10Aklapper) a:05razzi→03None Resetting inactive task assignee [08:13:52] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:56] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [08:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:15] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:18:16] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netbox2002.codfw.wmnet [08:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:21] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:55] PROBLEM - very high load average likely xfs on ms-be1066 is CRITICAL: CRITICAL - load average: 103.94, 100.25, 94.95 https://wikitech.wikimedia.org/wiki/Swift [08:22:30] (03PS2) 10Majavah: P:openstack::designate: set base_url to use the https port [puppet] - 10https://gerrit.wikimedia.org/r/800948 (https://phabricator.wikimedia.org/T267194) [08:22:32] (03PS3) 10Majavah: P:openstack::glance: remove primary_image_store concept [puppet] - 10https://gerrit.wikimedia.org/r/800949 [08:22:34] (03PS3) 10Majavah: openstack::cinder: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800950 [08:22:36] (03PS3) 10Majavah: openstack::nova: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800951 [08:22:38] (03PS3) 10Majavah: P:openstack::haproxy: codfw1dev: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800952 (https://phabricator.wikimedia.org/T267194) [08:22:40] (03PS3) 10Majavah: P:openstack::haproxy: eqiad1: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800953 (https://phabricator.wikimedia.org/T267194) [08:22:42] (03PS3) 10Majavah: P:openstack::designate::firewall: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800954 (https://phabricator.wikimedia.org/T267194) [08:22:44] (03PS3) 10Majavah: P:openstack: misc cleanup for non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800955 (https://phabricator.wikimedia.org/T267194) [08:23:57] (03CR) 10Awight: "For the record, this patch was associated with T308932." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793873 (owner: 10Ladsgroup) [08:24:27] 10SRE, 10Icinga, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10Ladsgroup) One random note. I think if a host is depooled (https://noc.wikimedia.org/dbconfig/eqiad.json), it shouldn't page under any condition. We have pages for user imp... [08:28:11] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:29:42] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [08:29:43] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [08:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:57] (03CR) 10Ayounsi: "1 comment then lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/793428 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [08:32:47] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:32:51] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [08:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:04] (03CR) 10Ladsgroup: [C: 03+2] Stop trying to pass legacy page_restrictions to RestrictionStore [extensions/LiquidThreads] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800705 (https://phabricator.wikimedia.org/T309460) (owner: 10Ladsgroup) [08:34:09] (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/793512 (https://phabricator.wikimedia.org/T308768) (owner: 10Volans) [08:34:43] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [08:34:46] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [08:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:19] !log jbond@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:35:20] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netbox2002.codfw.wmnet [08:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:51] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [08:35:52] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [08:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:58] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [08:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [08:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:06] (03Merged) 10jenkins-bot: Stop trying to pass legacy page_restrictions to RestrictionStore [extensions/LiquidThreads] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/800705 (https://phabricator.wikimedia.org/T309460) (owner: 10Ladsgroup) [08:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:03] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:38:17] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:25] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [08:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:01] (03CR) 10Volans: [C: 03+1] "LGTM, see comment inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 (owner: 10Jbond) [08:40:02] <_joe_> I'm about to perform a null deployment to test again that restarts with scap deployments work as expected [08:40:53] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.13/extensions/LiquidThreads/classes/Thread.php: Backport: [[gerrit:800705|Stop trying to pass legacy page_restrictions to RestrictionStore (T309460)]] (duration: 00m 47s) [08:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:59] T309460: PHP Notice: Undefined property: stdClass::$page_restrictions - https://phabricator.wikimedia.org/T309460 [08:41:00] I'm done with the deploy [08:42:14] (03PS1) 10David Caro: nova: add user to libvirt-qemu [puppet] - 10https://gerrit.wikimedia.org/r/801336 (https://phabricator.wikimedia.org/T309342) [08:42:19] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:42:20] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netbox2002.codfw.wmnet [08:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:41] !log oblivian@deploy1002 Synchronized README: testing php restarts with scap, T266055 (duration: 00m 45s) [08:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:49] T266055: Update Scap to perform rolling restart for all MW deploy - https://phabricator.wikimedia.org/T266055 [08:43:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet should prune stale entries from sudoers.d - https://phabricator.wikimedia.org/T309268 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:43:56] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35614/console" [puppet] - 10https://gerrit.wikimedia.org/r/801336 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro) [08:44:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10TThoabala) [08:45:55] !log disable puppet fleet wide Gerrit:799344 [08:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:23] (03PS1) 10Giuseppe Lavagetto: Revert "scap::master: remove the unused scap::l10nupdate class" [puppet] - 10https://gerrit.wikimedia.org/r/801188 [08:46:30] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "scap::master: remove the unused scap::l10nupdate class" [puppet] - 10https://gerrit.wikimedia.org/r/801188 (owner: 10Giuseppe Lavagetto) [08:48:17] (03CR) 10Jbond: [C: 03+2] nrpe: manage sudo rules via nrpe::check (try 2) [puppet] - 10https://gerrit.wikimedia.org/r/799344 (owner: 10Majavah) [08:50:01] (03CR) 10Volans: [C: 03+2] Icinga: add page hashtag to paging host alerts [puppet] - 10https://gerrit.wikimedia.org/r/799903 (owner: 10Volans) [08:51:11] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.093 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:51:53] PROBLEM - very high load average likely xfs on ms-be1066 is CRITICAL: CRITICAL - load average: 105.85, 101.30, 97.14 https://wikitech.wikimedia.org/wiki/Swift [08:52:05] (03PS4) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 [08:52:45] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [08:52:45] (03CR) 10CI reject: [V: 04-1] P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [08:52:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [08:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:41] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [08:53:43] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [08:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:58] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [08:55:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [08:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:05] (03PS5) 10Slyngshede: P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 [08:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:01] (03CR) 10CI reject: [V: 04-1] P:aptrepo::private refactor code to allow for multiple repositories. [puppet] - 10https://gerrit.wikimedia.org/r/799340 (owner: 10Slyngshede) [08:58:12] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:58:15] (03PS1) 10Muehlenhoff: Remove LDAP access for sthart [puppet] - 10https://gerrit.wikimedia.org/r/801340 [08:59:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:59:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:33] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for sthart [puppet] - 10https://gerrit.wikimedia.org/r/801340 (owner: 10Muehlenhoff) [09:02:26] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [09:02:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [09:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:02:36] (03CR) 10Jelto: [C: 03+2] idp: add gitlab-new to idp [puppet] - 10https://gerrit.wikimedia.org/r/800666 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:55] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:03] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [09:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:28] (03PS1) 10Volans: Revert "Icinga: add page hashtag to paging host alerts" [puppet] - 10https://gerrit.wikimedia.org/r/801189 [09:03:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:50] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:06:51] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netbox2002.codfw.wmnet [09:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:40] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:08:45] (03CR) 10Volans: [C: 03+2] Revert "Icinga: add page hashtag to paging host alerts" [puppet] - 10https://gerrit.wikimedia.org/r/801189 (owner: 10Volans) [09:11:27] (03PS2) 10Jelto: gitlab: use gitlab1004 as replia/passive host [puppet] - 10https://gerrit.wikimedia.org/r/800728 (https://phabricator.wikimedia.org/T307142) [09:12:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:12:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:00] !log re-enable puppet [09:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:24] (03PS2) 10Muehlenhoff: idp::memcached: Only enable memcached_16 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/799354 (https://phabricator.wikimedia.org/T308214) [09:15:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [09:15:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [09:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:19:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:11] (03CR) 10Btullis: [C: 03+2] wikireplicas: Improve log message for skipped views [puppet] - 10https://gerrit.wikimedia.org/r/786382 (owner: 10BryanDavis) [09:23:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [09:23:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [09:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [09:27:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [09:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T309311)', diff saved to https://phabricator.wikimedia.org/P28915 and previous config saved to /var/cache/conftool/dbconfig/20220530-092751-ladsgroup.json [09:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:59] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [09:31:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [09:31:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [09:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T60674)', diff saved to https://phabricator.wikimedia.org/P28916 and previous config saved to /var/cache/conftool/dbconfig/20220530-093121-ladsgroup.json [09:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:30] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [09:31:48] ACKNOWLEDGEMENT - SSH on db2088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Marostegui https://phabricator.wikimedia.org/T309485 https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:31:48] ACKNOWLEDGEMENT - Host db2088 is DOWN: PING CRITICAL - Packet loss = 100% Marostegui https://phabricator.wikimedia.org/T309485 [09:32:51] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [09:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:13] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netbox2002.codfw.wmnet [09:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:33] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [09:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:08] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host netbox2002.codfw.wmnet [09:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:47] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [09:35:48] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netbox2002.codfw.wmnet [09:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:16] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:37:40] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [09:37:42] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [09:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:50] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [09:37:53] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [09:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:18] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [09:38:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [09:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:46] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:39:17] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Gehel) [09:40:11] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [09:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:14] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [09:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:19] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [09:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [09:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:44] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:40:48] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:40:49] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [09:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:52] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [09:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:41:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:52] 10SRE-tools, 10Spicerack: sre.ganeti.makevm NXDOMAIN race condition - https://phabricator.wikimedia.org/T309505 (10jbond) p:05Triage→03Medium [09:46:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10Tchanders) 05Invalid→03Open [09:48:40] !log jbond@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netbox2002.codfw.wmnet [09:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:06] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [09:53:34] !log oblivian@deploy1002 Synchronized README: testing php restarts with scap, T266055 (duration: 03m 30s) [09:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:40] T266055: Update Scap to perform rolling restart for all MW deploy - https://phabricator.wikimedia.org/T266055 [09:55:11] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/800731 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [09:56:10] RECOVERY - very high load average likely xfs on ms-be1066 is OK: OK - load average: 58.08, 63.11, 72.95 https://wikitech.wikimedia.org/wiki/Swift [10:03:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T60674)', diff saved to https://phabricator.wikimedia.org/P28917 and previous config saved to /var/cache/conftool/dbconfig/20220530-100312-ladsgroup.json [10:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:19] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [10:03:20] (03PS1) 10Giuseppe Lavagetto: scap: use capitalized true [puppet] - 10https://gerrit.wikimedia.org/r/801345 [10:06:27] (03PS1) 10Jbond: install: netbox2002 add mac for netbox2002 [puppet] - 10https://gerrit.wikimedia.org/r/801347 [10:08:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [10:12:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:12:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T60674)', diff saved to https://phabricator.wikimedia.org/P28918 and previous config saved to /var/cache/conftool/dbconfig/20220530-101236-ladsgroup.json [10:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:44] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [10:13:11] (03PS1) 10Cathal Mooney: Re-add VRRP term to labs-in filter [homer/public] - 10https://gerrit.wikimedia.org/r/801348 (https://phabricator.wikimedia.org/T304989) [10:13:25] (03PS2) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [10:14:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: use capitalized true [puppet] - 10https://gerrit.wikimedia.org/r/801345 (owner: 10Giuseppe Lavagetto) [10:14:27] (03CR) 10Jbond: [C: 03+2] install: netbox2002 add mac for netbox2002 [puppet] - 10https://gerrit.wikimedia.org/r/801347 (owner: 10Jbond) [10:14:52] _joe_: happy for me to merge your cr [10:15:15] <_joe_> jbond: yes I somehow ran the alias for running puppet instead than for merging changes [10:15:27] :D, merging [10:15:29] <_joe_> please do :P [10:15:33] (03CR) 10Cathal Mooney: [C: 03+2] Re-add VRRP term to labs-in filter [homer/public] - 10https://gerrit.wikimedia.org/r/801348 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [10:15:46] _joe_: done [10:15:51] <_joe_> thanks [10:16:52] (03Merged) 10jenkins-bot: Re-add VRRP term to labs-in filter [homer/public] - 10https://gerrit.wikimedia.org/r/801348 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [10:18:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P28920 and previous config saved to /var/cache/conftool/dbconfig/20220530-101817-ladsgroup.json [10:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T60674)', diff saved to https://phabricator.wikimedia.org/P28921 and previous config saved to /var/cache/conftool/dbconfig/20220530-101917-ladsgroup.json [10:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:23] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [10:22:51] (03CR) 10Muehlenhoff: [C: 03+2] memcached: Untangle TLS/1.6 options [puppet] - 10https://gerrit.wikimedia.org/r/799348 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [10:28:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T309311)', diff saved to https://phabricator.wikimedia.org/P28922 and previous config saved to /var/cache/conftool/dbconfig/20220530-102805-ladsgroup.json [10:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:13] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [10:29:16] (03CR) 10Hnowlan: [C: 03+2] changeprop: Switch Beta Cluster RESTbase target server to restbase04 [deployment-charts] - 10https://gerrit.wikimedia.org/r/790425 (https://phabricator.wikimedia.org/T306052) (owner: 10Jforrester) [10:29:57] (03CR) 10Hnowlan: [C: 03+2] deployment-prep: Drop deployment-restbase03, no longer to be used [puppet] - 10https://gerrit.wikimedia.org/r/790424 (https://phabricator.wikimedia.org/T306052) (owner: 10Jforrester) [10:33:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P28923 and previous config saved to /var/cache/conftool/dbconfig/20220530-103322-ladsgroup.json [10:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ayounsi) Please don't forget to run Homer after re-naming as the switch port description contains the hostname. The current outs... [10:34:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28924 and previous config saved to /var/cache/conftool/dbconfig/20220530-103422-ladsgroup.json [10:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:44] (03Merged) 10jenkins-bot: changeprop: Switch Beta Cluster RESTbase target server to restbase04 [deployment-charts] - 10https://gerrit.wikimedia.org/r/790425 (https://phabricator.wikimedia.org/T306052) (owner: 10Jforrester) [10:39:48] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:40:21] (03CR) 10Hnowlan: [C: 03+2] "Deployed using https://wikitech.wikimedia.org/wiki/Changeprop#To_deployment-prep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/790425 (https://phabricator.wikimedia.org/T306052) (owner: 10Jforrester) [10:42:00] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:42:15] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Lucas_Werkmeister_WMDE) [10:43:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P28925 and previous config saved to /var/cache/conftool/dbconfig/20220530-104310-ladsgroup.json [10:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:45] !log jbond@cumin2002 START - Cookbook sre.hosts.decommission for hosts netbox2002.codfw.wmnet [10:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T60674)', diff saved to https://phabricator.wikimedia.org/P28926 and previous config saved to /var/cache/conftool/dbconfig/20220530-104827-ladsgroup.json [10:48:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:48:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [10:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:33] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [10:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:44] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [10:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P28927 and previous config saved to /var/cache/conftool/dbconfig/20220530-104927-ladsgroup.json [10:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:08] 10SRE, 10Icinga, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10Volans) >>! In T309447#7966236, @fgiunchedi wrote: > Off the top of my head I can't think of any obvious reason why Icinga decided to send out the notification, for now I'm... [10:51:30] (03PS1) 10Jbond: CONTRIBUTORS: add Lucas Werkmeister [puppet] - 10https://gerrit.wikimedia.org/r/801353 (https://phabricator.wikimedia.org/T308013) [10:51:41] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10observability: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447 (10Volans) [10:52:11] (03PS1) 10Muehlenhoff: Add Lucas Werkmeister to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/801354 [10:52:28] lol [10:52:57] jbond, moritzm: that looks like a double change to me :P [10:53:00] all great minds think alike :-) [10:53:02] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:53:03] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netbox2002.codfw.wmnet [10:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:34] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: sre.ganeti.makevm NXDOMAIN race condition - https://phabricator.wikimedia.org/T309505 (10Volans) Sure, let's call that cookbook from the makevm at the right time. [10:53:41] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [10:53:43] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [10:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:51] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [10:53:55] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [10:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:58] (03CR) 10Jbond: [C: 03+2] CONTRIBUTORS: add Lucas Werkmeister [puppet] - 10https://gerrit.wikimedia.org/r/801353 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [10:56:16] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [10:56:19] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [10:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:26] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [10:56:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [10:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:53] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:57] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [10:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:39] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [10:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:42] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [10:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P28928 and previous config saved to /var/cache/conftool/dbconfig/20220530-105815-ladsgroup.json [10:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:02] (03Abandoned) 10Muehlenhoff: Add Lucas Werkmeister to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/801354 (owner: 10Muehlenhoff) [11:01:38] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:01:39] !log jbond@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netbox2002.codfw.wmnet [11:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:11] (03PS1) 10Jbond: sre.ganeti.makevm: Clear the DNS cache before adding the ganeti instance [cookbooks] - 10https://gerrit.wikimedia.org/r/801355 [11:03:31] (03PS2) 10Jbond: sre.ganeti.makevm: Clear the DNS cache before adding the ganeti instance [cookbooks] - 10https://gerrit.wikimedia.org/r/801355 [11:03:52] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/801355 (owner: 10Jbond) [11:04:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T60674)', diff saved to https://phabricator.wikimedia.org/P28929 and previous config saved to /var/cache/conftool/dbconfig/20220530-110432-ladsgroup.json [11:04:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:04:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:38] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [11:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:38] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host netbox2002.codfw.wmnet [11:05:39] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [11:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:53] (03CR) 10Jbond: [C: 03+2] sre.ganeti.makevm: Clear the DNS cache before adding the ganeti instance [cookbooks] - 10https://gerrit.wikimedia.org/r/801355 (owner: 10Jbond) [11:06:25] (03PS3) 10Jbond: sre.ganeti.makevm: Clear the DNS cache before adding the ganeti instance [cookbooks] - 10https://gerrit.wikimedia.org/r/801355 (https://phabricator.wikimedia.org/T309505) [11:07:10] (03CR) 10Jelto: [C: 03+2] gitlab: use gitlab1004 as replia/passive host [puppet] - 10https://gerrit.wikimedia.org/r/800728 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [11:07:18] (03PS1) 10Marostegui: Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801190 [11:08:30] (03CR) 10Marostegui: [C: 03+2] Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801190 (owner: 10Marostegui) [11:09:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 5%: After cloning db1128', diff saved to https://phabricator.wikimedia.org/P28930 and previous config saved to /var/cache/conftool/dbconfig/20220530-110935-root.json [11:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:43] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [11:09:46] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [11:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:57] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:09:58] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox2002.codfw.wmnet on all recursors [11:09:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Addshore) [11:10:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox2002.codfw.wmnet on all recursors [11:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:23] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Addshore) [11:13:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T309311)', diff saved to https://phabricator.wikimedia.org/P28931 and previous config saved to /var/cache/conftool/dbconfig/20220530-111320-ladsgroup.json [11:13:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [11:13:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [11:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:27] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:13:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T309311)', diff saved to https://phabricator.wikimedia.org/P28932 and previous config saved to /var/cache/conftool/dbconfig/20220530-111328-ladsgroup.json [11:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:56] (03PS1) 10Marostegui: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/801361 (https://phabricator.wikimedia.org/T308725) [11:16:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Tobi_WMDE_SW) [11:17:20] (03PS1) 10Marostegui: wmnet: Update s5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/801362 (https://phabricator.wikimedia.org/T308725) [11:17:35] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/801361 (https://phabricator.wikimedia.org/T308725) (owner: 10Marostegui) [11:17:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:17:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T60674)', diff saved to https://phabricator.wikimedia.org/P28933 and previous config saved to /var/cache/conftool/dbconfig/20220530-111743-ladsgroup.json [11:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:51] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [11:17:51] !log jbond@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netbox2002.codfw.wmnet [11:17:52] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/801362 (https://phabricator.wikimedia.org/T308725) (owner: 10Marostegui) [11:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T309311)', diff saved to https://phabricator.wikimedia.org/P28934 and previous config saved to /var/cache/conftool/dbconfig/20220530-111912-ladsgroup.json [11:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:18] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [11:21:53] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This won't work as-is. Right now the way we use to enforce a restart is checking for an absurd amount of free opcache, so this change woul" [puppet] - 10https://gerrit.wikimedia.org/r/792982 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [11:22:45] (03PS1) 10Jbond: installl: fix mac for netbox2002 [puppet] - 10https://gerrit.wikimedia.org/r/801370 [11:23:14] (03CR) 10Jbond: [C: 03+2] installl: fix mac for netbox2002 [puppet] - 10https://gerrit.wikimedia.org/r/801370 (owner: 10Jbond) [11:24:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 10%: After cloning db1128', diff saved to https://phabricator.wikimedia.org/P28935 and previous config saved to /var/cache/conftool/dbconfig/20220530-112439-root.json [11:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:56] !log joal@deploy1002 Started deploy [airflow-dags/analytics@f3bd88c]: (no justification provided) [11:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:09] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@f3bd88c]: (no justification provided) (duration: 00m 12s) [11:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [11:34:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P28936 and previous config saved to /var/cache/conftool/dbconfig/20220530-113417-ladsgroup.json [11:34:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [11:34:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T60674)', diff saved to https://phabricator.wikimedia.org/P28937 and previous config saved to /var/cache/conftool/dbconfig/20220530-113428-ladsgroup.json [11:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:45] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [11:39:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: After cloning db1128', diff saved to https://phabricator.wikimedia.org/P28938 and previous config saved to /var/cache/conftool/dbconfig/20220530-113943-root.json [11:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T60674)', diff saved to https://phabricator.wikimedia.org/P28939 and previous config saved to /var/cache/conftool/dbconfig/20220530-114109-ladsgroup.json [11:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:15] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [11:41:18] (03CR) 10Muehlenhoff: [C: 03+2] idp::memcached: Only enable memcached_16 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/799354 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [11:45:51] (03PS1) 10Volans: doc: set default language [software/spicerack] - 10https://gerrit.wikimedia.org/r/801375 [11:49:02] (03PS1) 10Muehlenhoff: Add Addshore and Tobias Gritschacher to contributors [puppet] - 10https://gerrit.wikimedia.org/r/801377 [11:49:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P28940 and previous config saved to /var/cache/conftool/dbconfig/20220530-114922-ladsgroup.json [11:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:28] (03CR) 10Addshore: [C: 03+1] Add Addshore and Tobias Gritschacher to contributors [puppet] - 10https://gerrit.wikimedia.org/r/801377 (owner: 10Muehlenhoff) [11:49:48] (03CR) 10Muehlenhoff: [C: 03+2] Add Addshore and Tobias Gritschacher to contributors [puppet] - 10https://gerrit.wikimedia.org/r/801377 (owner: 10Muehlenhoff) [11:51:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T60674)', diff saved to https://phabricator.wikimedia.org/P28941 and previous config saved to /var/cache/conftool/dbconfig/20220530-115153-ladsgroup.json [11:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:00] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [11:53:52] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Decide on details of progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) If we send a percentage of traffic to the local DC, is it necessary (for sessions etc.) to consistently send a given user to the same DC? [11:54:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: After cloning db1128', diff saved to https://phabricator.wikimedia.org/P28942 and previous config saved to /var/cache/conftool/dbconfig/20220530-115446-root.json [11:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28943 and previous config saved to /var/cache/conftool/dbconfig/20220530-115615-ladsgroup.json [11:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:52] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:01:52] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update s5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/801362 (https://phabricator.wikimedia.org/T308725) (owner: 10Marostegui) [12:03:18] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/801361 (https://phabricator.wikimedia.org/T308725) (owner: 10Marostegui) [12:03:26] (03PS1) 10Marostegui: Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801192 [12:04:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T309311)', diff saved to https://phabricator.wikimedia.org/P28944 and previous config saved to /var/cache/conftool/dbconfig/20220530-120427-ladsgroup.json [12:04:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [12:04:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [12:04:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:35] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [12:04:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T309311)', diff saved to https://phabricator.wikimedia.org/P28945 and previous config saved to /var/cache/conftool/dbconfig/20220530-120440-ladsgroup.json [12:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28946 and previous config saved to /var/cache/conftool/dbconfig/20220530-120530-root.json [12:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:56] (03CR) 10Marostegui: [C: 03+2] Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/801192 (owner: 10Marostegui) [12:06:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P28947 and previous config saved to /var/cache/conftool/dbconfig/20220530-120658-ladsgroup.json [12:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:47] !log installing java 8/11 security updates [12:09:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: After cloning db1128', diff saved to https://phabricator.wikimedia.org/P28948 and previous config saved to /var/cache/conftool/dbconfig/20220530-120950-root.json [12:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P28949 and previous config saved to /var/cache/conftool/dbconfig/20220530-121120-ladsgroup.json [12:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:39] (03PS1) 10Majavah: P:ceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/801380 [12:14:16] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35616/console" [puppet] - 10https://gerrit.wikimedia.org/r/801380 (owner: 10Majavah) [12:15:39] (03CR) 10Majavah: P:ceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/801380 (owner: 10Majavah) [12:20:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28950 and previous config saved to /var/cache/conftool/dbconfig/20220530-122034-root.json [12:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P28951 and previous config saved to /var/cache/conftool/dbconfig/20220530-122203-ladsgroup.json [12:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:05] PROBLEM - Check systemd state on ms-be1066 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: After cloning db1128', diff saved to https://phabricator.wikimedia.org/P28952 and previous config saved to /var/cache/conftool/dbconfig/20220530-122454-root.json [12:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:22] (03PS1) 10Kevin Bazira: ml-services: add frwikisource & frwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/801381 (https://phabricator.wikimedia.org/T307418) [12:25:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T309311)', diff saved to https://phabricator.wikimedia.org/P28953 and previous config saved to /var/cache/conftool/dbconfig/20220530-122534-ladsgroup.json [12:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:41] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [12:26:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T60674)', diff saved to https://phabricator.wikimedia.org/P28954 and previous config saved to /var/cache/conftool/dbconfig/20220530-122625-ladsgroup.json [12:26:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:30] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [12:26:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [12:26:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [12:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [12:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:38] (03PS1) 10Jbond: install: fix dhcpd config for netbox2002 [puppet] - 10https://gerrit.wikimedia.org/r/801382 [12:30:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] install: fix dhcpd config for netbox2002 [puppet] - 10https://gerrit.wikimedia.org/r/801382 (owner: 10Jbond) [12:33:04] (03CR) 10Alexandros Kosiaris: mathoid: pipeline bot promote (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot) [12:33:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot) [12:34:13] !log Drop renamed revision_actor_temp on s3 T307906 [12:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:22] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [12:35:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 10%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28955 and previous config saved to /var/cache/conftool/dbconfig/20220530-123538-root.json [12:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:17] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:36:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:36:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T60674)', diff saved to https://phabricator.wikimedia.org/P28956 and previous config saved to /var/cache/conftool/dbconfig/20220530-123644-ladsgroup.json [12:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:53] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [12:37:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T60674)', diff saved to https://phabricator.wikimedia.org/P28957 and previous config saved to /var/cache/conftool/dbconfig/20220530-123708-ladsgroup.json [12:37:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:37:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T60674)', diff saved to https://phabricator.wikimedia.org/P28958 and previous config saved to /var/cache/conftool/dbconfig/20220530-123716-ladsgroup.json [12:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [12:38:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [12:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:03] (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot) [12:39:27] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:40:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P28959 and previous config saved to /var/cache/conftool/dbconfig/20220530-124039-ladsgroup.json [12:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:19] (03CR) 10Ssingh: [C: 03+2] aptrepo: add a component for dnsdist/pdns-recursor for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793495 (https://phabricator.wikimedia.org/T305589) (owner: 10Ssingh) [12:46:05] (03PS2) 10Ssingh: aptrepo: add a component for dnsdist/pdns-recursor for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793495 (https://phabricator.wikimedia.org/T305589) [12:47:11] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:47:33] (03PS1) 10Majavah: sonofgridengine: grid_configurator: make the grid master a submit host [puppet] - 10https://gerrit.wikimedia.org/r/801385 (https://phabricator.wikimedia.org/T277653) [12:47:41] (03CR) 10Kosta Harlan: [C: 03+1] Log output of scheduled MediaWiki maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/800683 (https://phabricator.wikimedia.org/T285896) (owner: 10Gergő Tisza) [12:48:07] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [12:48:10] (03CR) 10Ssingh: "(rebased, no code change)" [puppet] - 10https://gerrit.wikimedia.org/r/793495 (https://phabricator.wikimedia.org/T305589) (owner: 10Ssingh) [12:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:15] (03CR) 10Ssingh: [C: 03+2] aptrepo: add a component for dnsdist/pdns-recursor for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/793495 (https://phabricator.wikimedia.org/T305589) (owner: 10Ssingh) [12:48:31] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [12:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 20%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28960 and previous config saved to /var/cache/conftool/dbconfig/20220530-125042-root.json [12:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:52:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T60674)', diff saved to https://phabricator.wikimedia.org/P28961 and previous config saved to /var/cache/conftool/dbconfig/20220530-125233-ladsgroup.json [12:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:42] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [12:54:21] (03PS2) 10Jbond: redfish: Assume all GET and HEAD requests are RO [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 [12:54:33] (03CR) 10Jbond: redfish: Assume all GET and HEAD requests are RO (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 (owner: 10Jbond) [12:55:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T60674)', diff saved to https://phabricator.wikimedia.org/P28962 and previous config saved to /var/cache/conftool/dbconfig/20220530-125539-ladsgroup.json [12:55:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P28963 and previous config saved to /var/cache/conftool/dbconfig/20220530-125544-ladsgroup.json [12:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:28] (03CR) 10Jbond: [C: 03+1] doc: set default language [software/spicerack] - 10https://gerrit.wikimedia.org/r/801375 (owner: 10Volans) [12:57:18] (03CR) 10Volans: [C: 03+2] doc: set default language [software/spicerack] - 10https://gerrit.wikimedia.org/r/801375 (owner: 10Volans) [12:57:48] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [12:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:37] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [12:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T1300). Please do the needful. [13:00:04] koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:19] hi [13:00:21] hey koi [13:00:25] i can deploy today [13:00:45] ok! [13:00:53] I will come up with another patch after this, so please wait for a while :) [13:02:04] koi: i see Timo reviewed it (and looks he didn't yet have the chance to look at the changes you made since then). I prefer waiting for his opinion (and +1/-1). [13:02:11] RECOVERY - Check systemd state on ms-be1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:27] (03PS1) 10Volans: doc: set default language [software/homer] - 10https://gerrit.wikimedia.org/r/801387 [13:02:39] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [13:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:01] hmm ok, will postpone till receiving a response [13:03:22] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [13:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:38] !log upload pdns-recursor_4.6.2-1wm1 to apt.wm.o (bullseye) - T305589 [13:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:43] T305589: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 [13:04:12] (03PS1) 10Jbond: naggen2: inject # page alias for critical hosts [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) [13:04:53] (03CR) 10CI reject: [V: 04-1] redfish: Assume all GET and HEAD requests are RO [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 (owner: 10Jbond) [13:05:32] thanks koi [13:05:37] anything else? [13:05:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28964 and previous config saved to /var/cache/conftool/dbconfig/20220530-130545-root.json [13:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:54] (03CR) 10Volans: redfish: Assume all GET and HEAD requests are RO (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 (owner: 10Jbond) [13:06:08] nope, another patch from me is depend on this 0_o [13:06:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35617/console" [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [13:06:13] (03PS1) 10Volans: doc: set default language [software/cumin] - 10https://gerrit.wikimedia.org/r/801389 [13:06:26] (03PS1) 10Volans: doc: set default language [software/pywmflib] - 10https://gerrit.wikimedia.org/r/801390 [13:06:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:06:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T60674)', diff saved to https://phabricator.wikimedia.org/P28965 and previous config saved to /var/cache/conftool/dbconfig/20220530-130642-ladsgroup.json [13:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:47] (03PS3) 10Jbond: redfish: Assume all GET and HEAD requests are RO [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 [13:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:50] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [13:07:18] koi: okay, in that case, we're done i think :). see you later! [13:07:30] see you! [13:07:33] (03CR) 10Jbond: redfish: Assume all GET and HEAD requests are RO (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 (owner: 10Jbond) [13:08:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot) [13:08:46] (03CR) 10Volans: [C: 03+1] "I didn't tested it but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [13:09:19] (03Merged) 10jenkins-bot: doc: set default language [software/spicerack] - 10https://gerrit.wikimedia.org/r/801375 (owner: 10Volans) [13:10:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T60674)', diff saved to https://phabricator.wikimedia.org/P28966 and previous config saved to /var/cache/conftool/dbconfig/20220530-131003-ladsgroup.json [13:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P28967 and previous config saved to /var/cache/conftool/dbconfig/20220530-131044-ladsgroup.json [13:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T309311)', diff saved to https://phabricator.wikimedia.org/P28968 and previous config saved to /var/cache/conftool/dbconfig/20220530-131049-ladsgroup.json [13:10:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:10:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:55] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [13:10:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T309311)', diff saved to https://phabricator.wikimedia.org/P28969 and previous config saved to /var/cache/conftool/dbconfig/20220530-131057-ladsgroup.json [13:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:01] (03PS2) 10Muehlenhoff: motd/kmod/debconf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799312 (https://phabricator.wikimedia.org/T308013) [13:13:44] (03CR) 10CI reject: [V: 04-1] doc: set default language [software/cumin] - 10https://gerrit.wikimedia.org/r/801389 (owner: 10Volans) [13:14:05] (03CR) 10Ayounsi: [C: 03+1] doc: set default language [software/homer] - 10https://gerrit.wikimedia.org/r/801387 (owner: 10Volans) [13:14:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T60674)', diff saved to https://phabricator.wikimedia.org/P28970 and previous config saved to /var/cache/conftool/dbconfig/20220530-131419-ladsgroup.json [13:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:27] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [13:14:28] !log upload dnsdist_1.7.1-1wm1 to apt.wm.o (bullseye) - T305589 [13:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:34] T305589: Upgrading Wikidough and durum VMs to bullseye - https://phabricator.wikimedia.org/T305589 [13:14:38] 10SRE: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10MoritzMuehlenhoff) I'm temporarily removing the SRE-Access-Requests tag, so that this doesn't shop up on our Clinic Duty workboard. When this is good to move forward, please simply re-add it. [13:14:48] 10SRE: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10MoritzMuehlenhoff) [13:15:08] 10SRE, 10Security: Cookbook to reboot cassandra nodes - https://phabricator.wikimedia.org/T288975 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:15:49] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:16:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T309311)', diff saved to https://phabricator.wikimedia.org/P28971 and previous config saved to /var/cache/conftool/dbconfig/20220530-131644-ladsgroup.json [13:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:51] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [13:20:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 40%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28972 and previous config saved to /var/cache/conftool/dbconfig/20220530-132049-root.json [13:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P28973 and previous config saved to /var/cache/conftool/dbconfig/20220530-132510-ladsgroup.json [13:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P28974 and previous config saved to /var/cache/conftool/dbconfig/20220530-132549-ladsgroup.json [13:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:55] RECOVERY - Memcached on idp-test1002 is OK: TCP OK - 0.002 second response time on 208.80.154.72 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [13:26:55] RECOVERY - Check systemd state on idp-test1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:25] (03CR) 10Elukey: [C: 03+2] ml-services: add frwikisource & frwiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/801381 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [13:27:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1002.wikimedia.org [13:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28975 and previous config saved to /var/cache/conftool/dbconfig/20220530-132925-ladsgroup.json [13:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1002.wikimedia.org [13:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org [13:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P28976 and previous config saved to /var/cache/conftool/dbconfig/20220530-133149-ladsgroup.json [13:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:56] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:39] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T60674)', diff saved to https://phabricator.wikimedia.org/P28977 and previous config saved to /var/cache/conftool/dbconfig/20220530-133524-ladsgroup.json [13:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:31] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [13:35:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28978 and previous config saved to /var/cache/conftool/dbconfig/20220530-133553-root.json [13:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:39] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp-test2002.wikimedia.org [13:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P28979 and previous config saved to /var/cache/conftool/dbconfig/20220530-134015-ladsgroup.json [13:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:19] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:40:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T60674)', diff saved to https://phabricator.wikimedia.org/P28980 and previous config saved to /var/cache/conftool/dbconfig/20220530-134054-ladsgroup.json [13:40:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [13:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [13:41:01] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [13:41:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [13:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [13:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:29] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10ArielGlenn) I wouldn't assign it to a specific person there. But you could maybe ping @nskaggs to raise awareness (team manager). Or I c... [13:44:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P28981 and previous config saved to /var/cache/conftool/dbconfig/20220530-134430-ladsgroup.json [13:44:33] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10User-zeljkofilipin: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) >>! In T306181#7963731, @phuedx wrote: >>>! In T306181#7914450, @akosiaris wrote... [13:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P28982 and previous config saved to /var/cache/conftool/dbconfig/20220530-134654-ladsgroup.json [13:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28983 and previous config saved to /var/cache/conftool/dbconfig/20220530-135029-ladsgroup.json [13:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 60%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28984 and previous config saved to /var/cache/conftool/dbconfig/20220530-135057-root.json [13:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:13] PROBLEM - Check systemd state on netbox2002 is CRITICAL: CRITICAL - degraded: The following units failed: rq-netbox.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T60674)', diff saved to https://phabricator.wikimedia.org/P28985 and previous config saved to /var/cache/conftool/dbconfig/20220530-135520-ladsgroup.json [13:55:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [13:55:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [13:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:27] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [13:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T60674)', diff saved to https://phabricator.wikimedia.org/P28986 and previous config saved to /var/cache/conftool/dbconfig/20220530-135528-ladsgroup.json [13:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [13:56:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [13:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T60674)', diff saved to https://phabricator.wikimedia.org/P28987 and previous config saved to /var/cache/conftool/dbconfig/20220530-135654-ladsgroup.json [13:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:29] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:59:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T60674)', diff saved to https://phabricator.wikimedia.org/P28988 and previous config saved to /var/cache/conftool/dbconfig/20220530-135935-ladsgroup.json [13:59:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:59:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T309311)', diff saved to https://phabricator.wikimedia.org/P28989 and previous config saved to /var/cache/conftool/dbconfig/20220530-140159-ladsgroup.json [14:02:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [14:02:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [14:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:07] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:02:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T309311)', diff saved to https://phabricator.wikimedia.org/P28990 and previous config saved to /var/cache/conftool/dbconfig/20220530-140207-ladsgroup.json [14:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:03] (03CR) 10Filippo Giunchedi: "Untested but LGTM, see inline for nit/question" [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [14:05:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P28991 and previous config saved to /var/cache/conftool/dbconfig/20220530-140534-ladsgroup.json [14:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10ArielGlenn) a:05ArielGlenn→03None Not sure who should get this next but it's not Hannah or I :-) I was never involved in the configura... [14:06:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28992 and previous config saved to /var/cache/conftool/dbconfig/20220530-140601-root.json [14:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:00] (03CR) 10Nikerabbit: [C: 03+1] testwiki: Enable Section Translation in 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800833 (https://phabricator.wikimedia.org/T308829) (owner: 10KartikMistry) [14:08:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T309311)', diff saved to https://phabricator.wikimedia.org/P28993 and previous config saved to /var/cache/conftool/dbconfig/20220530-140800-ladsgroup.json [14:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:06] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:09:43] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:13:11] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:13:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [14:13:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [14:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T60674)', diff saved to https://phabricator.wikimedia.org/P28994 and previous config saved to /var/cache/conftool/dbconfig/20220530-141320-ladsgroup.json [14:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:31] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:20:03] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:20:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T60674)', diff saved to https://phabricator.wikimedia.org/P28995 and previous config saved to /var/cache/conftool/dbconfig/20220530-142039-ladsgroup.json [14:20:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:20:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:45] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:20:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T60674)', diff saved to https://phabricator.wikimedia.org/P28996 and previous config saved to /var/cache/conftool/dbconfig/20220530-142047-ladsgroup.json [14:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T60674)', diff saved to https://phabricator.wikimedia.org/P28997 and previous config saved to /var/cache/conftool/dbconfig/20220530-142057-ladsgroup.json [14:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: After migrating it to 10.6', diff saved to https://phabricator.wikimedia.org/P28998 and previous config saved to /var/cache/conftool/dbconfig/20220530-142105-root.json [14:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P28999 and previous config saved to /var/cache/conftool/dbconfig/20220530-142305-ladsgroup.json [14:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:46] <_joe_> jouncebot: next [14:24:46] In 1 hour(s) and 5 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T1530) [14:27:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T60674)', diff saved to https://phabricator.wikimedia.org/P29001 and previous config saved to /var/cache/conftool/dbconfig/20220530-142726-ladsgroup.json [14:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:34] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:28:06] (03PS2) 10Giuseppe Lavagetto: mediawiki_canaries: disable opcache revalidation [puppet] - 10https://gerrit.wikimedia.org/r/792983 (https://phabricator.wikimedia.org/T266055) [14:28:37] PROBLEM - Check that envoy is running on idp-test2002 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:29:33] <_joe_> jbond: I should ignore idp-test2002, right? [14:30:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:38] _joe_: yes i think so but check with moritzm [14:31:28] (03CR) 10Volans: [C: 03+1] "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [14:31:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10TThoabala) @herron / @Dzahn I am back from leave and I would like to carry on with this ticket. I generated a new key and updated it on the ticket. [14:31:37] (03CR) 10Volans: [C: 03+2] doc: set default language [software/homer] - 10https://gerrit.wikimedia.org/r/801387 (owner: 10Volans) [14:32:13] (03CR) 10Volans: [C: 03+2] "Equivalent of I521fcd9e5ac36f0af2ad3f7d6382ea305994ff9d, self-merging." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/801390 (owner: 10Volans) [14:32:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T60674)', diff saved to https://phabricator.wikimedia.org/P29002 and previous config saved to /var/cache/conftool/dbconfig/20220530-143238-ladsgroup.json [14:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:46] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:34:50] _joe_: yeah, ignore. I'm currently fighting with envoy there [14:35:04] <_joe_> moritzm: once envoy won, I'm happy to help [14:35:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:19] <_joe_> but I won't spoil your moment of envoy zen [14:35:30] (03Merged) 10jenkins-bot: doc: set default language [software/homer] - 10https://gerrit.wikimedia.org/r/801387 (owner: 10Volans) [14:36:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29003 and previous config saved to /var/cache/conftool/dbconfig/20220530-143602-ladsgroup.json [14:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:19] <_joe_> the trick there is to admit you're powerless in front of that wall of yaml/protobuf nightmare :P [14:36:22] (03Merged) 10jenkins-bot: doc: set default language [software/pywmflib] - 10https://gerrit.wikimedia.org/r/801390 (owner: 10Volans) [14:36:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki_canaries: disable opcache revalidation [puppet] - 10https://gerrit.wikimedia.org/r/792983 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [14:36:51] hehe :-) [14:38:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P29004 and previous config saved to /var/cache/conftool/dbconfig/20220530-143810-ladsgroup.json [14:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:37] (03CR) 10Ottomata: [C: 03+1] Stream config for android breadcrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801018 (owner: 10Sharvaniharan) [14:40:06] (03CR) 10Jbond: [V: 03+1] naggen2: inject # page alias for critical hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [14:40:07] RECOVERY - Check that envoy is running on idp-test2002 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:41:19] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:41:41] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 (owner: 10Jbond) [14:42:15] (03PS14) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [14:42:18] !log installing clamav security updates on otrs1001 [14:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29005 and previous config saved to /var/cache/conftool/dbconfig/20220530-144232-ladsgroup.json [14:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:12] 10SRE, 10Infrastructure-Foundations, 10netops: DHCPd: update config to log more info - https://phabricator.wikimedia.org/T309524 (10Volans) IIRC that hostname is evaluated by the DHCP at restart time and then the resulting IP is used in the configuration. Because that's a valid hostname in our DNS it would h... [14:45:17] (03CR) 10Filippo Giunchedi: "I played a little with this today and uploaded a PS that works in Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [14:46:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:27] (03CR) 10Filippo Giunchedi: [C: 03+1] naggen2: inject # page alias for critical hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801388 (https://phabricator.wikimedia.org/T236379) (owner: 10Jbond) [14:47:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P29006 and previous config saved to /var/cache/conftool/dbconfig/20220530-144743-ladsgroup.json [14:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:29] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, and 2 others: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @phuedx and @akosiaris for that information and for the patch. That's a great find ab... [14:50:27] (03CR) 10JMeybohm: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/799308 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:51:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29007 and previous config saved to /var/cache/conftool/dbconfig/20220530-145107-ladsgroup.json [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T309311)', diff saved to https://phabricator.wikimedia.org/P29008 and previous config saved to /var/cache/conftool/dbconfig/20220530-145315-ladsgroup.json [14:53:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [14:53:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [14:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:22] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:53:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T309311)', diff saved to https://phabricator.wikimedia.org/P29009 and previous config saved to /var/cache/conftool/dbconfig/20220530-145323-ladsgroup.json [14:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:12] (03PS1) 10Jbond: admin: add tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/801400 (https://phabricator.wikimedia.org/T303398) [14:55:47] (03Abandoned) 10Jbond: admin: add tsepothoabala to deployment [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [14:57:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P29010 and previous config saved to /var/cache/conftool/dbconfig/20220530-145737-ladsgroup.json [14:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T60674)', diff saved to https://phabricator.wikimedia.org/P29011 and previous config saved to /var/cache/conftool/dbconfig/20220530-145756-ladsgroup.json [14:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:02] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [14:59:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T309311)', diff saved to https://phabricator.wikimedia.org/P29012 and previous config saved to /var/cache/conftool/dbconfig/20220530-145919-ladsgroup.json [14:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:26] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:00:36] (03CR) 10Jbond: [C: 03+2] admin: add tsepothoabala [puppet] - 10https://gerrit.wikimedia.org/r/801400 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [15:02:36] (03PS1) 10Jbond: admin: add correct key for tthoabala [puppet] - 10https://gerrit.wikimedia.org/r/801401 (https://phabricator.wikimedia.org/T303398) [15:02:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P29013 and previous config saved to /var/cache/conftool/dbconfig/20220530-150248-ladsgroup.json [15:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:37] (03CR) 10Jbond: [C: 03+2] admin: add correct key for tthoabala [puppet] - 10https://gerrit.wikimedia.org/r/801401 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [15:06:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T60674)', diff saved to https://phabricator.wikimedia.org/P29014 and previous config saved to /var/cache/conftool/dbconfig/20220530-150612-ladsgroup.json [15:06:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:06:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:19] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:53] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:11:16] (03CR) 10Tchanders: [C: 04-1] Assign similareditors right to the checkuser group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte) [15:11:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) 05Open→03Resolved a:03jbond @TThoabala this has been merged, please re-open this task if you still have issues thanks [15:11:48] (03CR) 10Jbond: [C: 03+2] redfish: Assume all GET and HEAD requests are RO [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 (owner: 10Jbond) [15:12:31] (03CR) 10Tchanders: [C: 03+1] Deploy SimilarEditors to the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [15:12:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T60674)', diff saved to https://phabricator.wikimedia.org/P29015 and previous config saved to /var/cache/conftool/dbconfig/20220530-151242-ladsgroup.json [15:12:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [15:12:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [15:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:50] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:12:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T60674)', diff saved to https://phabricator.wikimedia.org/P29016 and previous config saved to /var/cache/conftool/dbconfig/20220530-151251-ladsgroup.json [15:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P29017 and previous config saved to /var/cache/conftool/dbconfig/20220530-151301-ladsgroup.json [15:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P29018 and previous config saved to /var/cache/conftool/dbconfig/20220530-151424-ladsgroup.json [15:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:35] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Gehel) a:03Gehel [15:16:46] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Gehel) a:05Gehel→03None [15:16:57] 10SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Gehel) a:03bking [15:17:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T60674)', diff saved to https://phabricator.wikimedia.org/P29019 and previous config saved to /var/cache/conftool/dbconfig/20220530-151753-ladsgroup.json [15:17:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [15:17:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [15:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:00] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:18:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T60674)', diff saved to https://phabricator.wikimedia.org/P29020 and previous config saved to /var/cache/conftool/dbconfig/20220530-151801-ladsgroup.json [15:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T60674)', diff saved to https://phabricator.wikimedia.org/P29021 and previous config saved to /var/cache/conftool/dbconfig/20220530-151932-ladsgroup.json [15:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for sgimeno - https://phabricator.wikimedia.org/T309045 (10Sgs) @thcipriani is there any impediment with my request? Ty. [15:20:18] jouncebot: nowandnext [15:20:18] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [15:20:18] In 0 hour(s) and 9 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T1530) [15:20:21] PROBLEM - SSH on restbase1018.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:18] (03Merged) 10jenkins-bot: redfish: Assume all GET and HEAD requests are RO [software/spicerack] - 10https://gerrit.wikimedia.org/r/800676 (owner: 10Jbond) [15:21:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:21:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T60674)', diff saved to https://phabricator.wikimedia.org/P29022 and previous config saved to /var/cache/conftool/dbconfig/20220530-152202-ladsgroup.json [15:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:12] (03PS1) 10Muehlenhoff: idp-test: Point to the new Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/801402 (https://phabricator.wikimedia.org/T308214) [15:26:44] (03CR) 10Muehlenhoff: [C: 03+2] dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799308 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:26:50] (03PS2) 10Muehlenhoff: dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799308 (https://phabricator.wikimedia.org/T308013) [15:28:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P29023 and previous config saved to /var/cache/conftool/dbconfig/20220530-152806-ladsgroup.json [15:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P29024 and previous config saved to /var/cache/conftool/dbconfig/20220530-152929-ladsgroup.json [15:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T1530). [15:33:08] (03PS1) 10Stang: enwiki: Regenerate inconsistent logo-1x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801403 (https://phabricator.wikimedia.org/T309544) [15:34:25] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:34:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P29025 and previous config saved to /var/cache/conftool/dbconfig/20220530-153437-ladsgroup.json [15:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:14] (03CR) 10Muehlenhoff: [C: 03+2] motd/kmod/debconf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799312 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:35:30] (03CR) 10Physikerwelt: mathoid: pipeline bot promote (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot) [15:37:26] (03PS4) 10Hnowlan: Set production role and add config for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/779846 [15:38:28] (03PS1) 10Lucas Werkmeister (WMDE): Refresh English Wikipedia logo file (enwiki.png) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) [15:39:27] (03CR) 10Lucas Werkmeister (WMDE): Refresh English Wikipedia logo file (enwiki.png) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE)) [15:41:27] (03PS2) 10Lucas Werkmeister (WMDE): Refresh English Wikipedia logo file (enwiki.png) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) [15:42:59] (03CR) 10Lucas Werkmeister (WMDE): enwiki: Regenerate inconsistent logo-1x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801403 (https://phabricator.wikimedia.org/T309544) (owner: 10Stang) [15:43:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T60674)', diff saved to https://phabricator.wikimedia.org/P29026 and previous config saved to /var/cache/conftool/dbconfig/20220530-154311-ladsgroup.json [15:43:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:43:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:18] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:43:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T60674)', diff saved to https://phabricator.wikimedia.org/P29027 and previous config saved to /var/cache/conftool/dbconfig/20220530-154319-ladsgroup.json [15:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T309311)', diff saved to https://phabricator.wikimedia.org/P29028 and previous config saved to /var/cache/conftool/dbconfig/20220530-154434-ladsgroup.json [15:44:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [15:44:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [15:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:40] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:58] (03PS3) 10Tchanders: Add QuickSurveys survey for the SimilarEditors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) [15:48:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [15:48:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [15:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T309311)', diff saved to https://phabricator.wikimedia.org/P29029 and previous config saved to /var/cache/conftool/dbconfig/20220530-154830-ladsgroup.json [15:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P29030 and previous config saved to /var/cache/conftool/dbconfig/20220530-154942-ladsgroup.json [15:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T309311)', diff saved to https://phabricator.wikimedia.org/P29031 and previous config saved to /var/cache/conftool/dbconfig/20220530-155441-ladsgroup.json [15:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:47] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:57:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T60674)', diff saved to https://phabricator.wikimedia.org/P29032 and previous config saved to /var/cache/conftool/dbconfig/20220530-155735-ladsgroup.json [15:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:41] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [15:59:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T60674)', diff saved to https://phabricator.wikimedia.org/P29033 and previous config saved to /var/cache/conftool/dbconfig/20220530-155933-ladsgroup.json [15:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:49] (03CR) 10Hnowlan: [C: 03+2] Set production role and add config for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/779846 (owner: 10Hnowlan) [16:04:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T60674)', diff saved to https://phabricator.wikimedia.org/P29034 and previous config saved to /var/cache/conftool/dbconfig/20220530-160447-ladsgroup.json [16:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:53] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:07:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot) [16:09:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P29035 and previous config saved to /var/cache/conftool/dbconfig/20220530-160946-ladsgroup.json [16:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:01] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2027.codfw.wmnet with OS buster [16:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P29036 and previous config saved to /var/cache/conftool/dbconfig/20220530-161240-ladsgroup.json [16:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:18] (03PS1) 10Alexandros Kosiaris: eventgate-analytics: Bump 2022-05-30-145633-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801409 (https://phabricator.wikimedia.org/T306181) [16:14:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P29037 and previous config saved to /var/cache/conftool/dbconfig/20220530-161438-ladsgroup.json [16:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:28] (03PS2) 10Andrew Bogott: Magnum: misc config fixes [puppet] - 10https://gerrit.wikimedia.org/r/801016 [16:19:30] (03PS1) 10Andrew Bogott: Add 'region' arg to heat and magnum manifests [puppet] - 10https://gerrit.wikimedia.org/r/801410 [16:20:30] (03CR) 10Physikerwelt: mathoid: pipeline bot promote (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot) [16:21:39] (03CR) 10Andrew Bogott: [C: 03+2] Magnum: misc config fixes [puppet] - 10https://gerrit.wikimedia.org/r/801016 (owner: 10Andrew Bogott) [16:22:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T60674)', diff saved to https://phabricator.wikimedia.org/P29038 and previous config saved to /var/cache/conftool/dbconfig/20220530-162218-ladsgroup.json [16:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:25] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:23:24] (03CR) 10Andrew Bogott: [C: 03+2] Add 'region' arg to heat and magnum manifests [puppet] - 10https://gerrit.wikimedia.org/r/801410 (owner: 10Andrew Bogott) [16:24:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P29039 and previous config saved to /var/cache/conftool/dbconfig/20220530-162451-ladsgroup.json [16:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P29040 and previous config saved to /var/cache/conftool/dbconfig/20220530-162745-ladsgroup.json [16:27:48] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2027.codfw.wmnet with reason: host reimage [16:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P29041 and previous config saved to /var/cache/conftool/dbconfig/20220530-162943-ladsgroup.json [16:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:40] (03PS1) 10Volans: redfish: allow to submit tasks with DELETE [software/spicerack] - 10https://gerrit.wikimedia.org/r/801411 [16:30:52] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2027.codfw.wmnet with reason: host reimage [16:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:28] RECOVERY - SSH on restbase1018.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:37:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29042 and previous config saved to /var/cache/conftool/dbconfig/20220530-163723-ladsgroup.json [16:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T309311)', diff saved to https://phabricator.wikimedia.org/P29043 and previous config saved to /var/cache/conftool/dbconfig/20220530-163957-ladsgroup.json [16:40:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [16:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:03] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:40:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [16:40:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [16:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:12] (03CR) 10STran: "I'm fine to +1 this if we don't mind the absence of the privacy policy link." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [16:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [16:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:42] PROBLEM - Check systemd state on ms-be2066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-nic-firmware-textfile.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T60674)', diff saved to https://phabricator.wikimedia.org/P29044 and previous config saved to /var/cache/conftool/dbconfig/20220530-164250-ladsgroup.json [16:42:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [16:42:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [16:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:57] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [16:42:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T60674)', diff saved to https://phabricator.wikimedia.org/P29045 and previous config saved to /var/cache/conftool/dbconfig/20220530-164258-ladsgroup.json [16:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:26] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:43:46] (03PS5) 10Jbond: WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [16:44:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:44:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T309311)', diff saved to https://phabricator.wikimedia.org/P29046 and previous config saved to /var/cache/conftool/dbconfig/20220530-164423-ladsgroup.json [16:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T60674)', diff saved to https://phabricator.wikimedia.org/P29047 and previous config saved to /var/cache/conftool/dbconfig/20220530-164448-ladsgroup.json [16:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:44:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:44:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance [16:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance [16:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:32] (03CR) 10CI reject: [V: 04-1] WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [16:46:44] !log volans@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [16:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:51] 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10ops-monitoring-bot) Host rebooted by volans@cumin2002 with reason: Converted SSDs to non-RAID [16:49:20] ACKNOWLEDGEMENT - MD RAID on ms-be2066 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T309553 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:49:23] 10SRE, 10ops-codfw: Degraded RAID on ms-be2066 - https://phabricator.wikimedia.org/T309553 (10ops-monitoring-bot) [16:50:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T309311)', diff saved to https://phabricator.wikimedia.org/P29048 and previous config saved to /var/cache/conftool/dbconfig/20220530-165034-ladsgroup.json [16:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:41] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:51:12] (03CR) 10Jbond: [C: 03+1] redfish: allow to submit tasks with DELETE [software/spicerack] - 10https://gerrit.wikimedia.org/r/801411 (owner: 10Volans) [16:51:46] 10SRE, 10ops-codfw: Degraded RAID on ms-be2066 - https://phabricator.wikimedia.org/T309553 (10Volans) Related to my tests on T309027 for a pre-production host. Ignore for now. I'll close it once I'm sure there are no issues. Sorry for the noise. [16:51:49] (03CR) 10Volans: [C: 03+2] redfish: allow to submit tasks with DELETE [software/spicerack] - 10https://gerrit.wikimedia.org/r/801411 (owner: 10Volans) [16:52:13] (03CR) 10Jbond: [C: 03+1] idp-test: Point to the new Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/801402 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [16:52:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29049 and previous config saved to /var/cache/conftool/dbconfig/20220530-165228-ladsgroup.json [16:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:22] (03PS1) 10Volans: sre.hosts.reboot-single: hide cumin progress [cookbooks] - 10https://gerrit.wikimedia.org/r/801414 [16:56:36] (03CR) 10Stang: enwiki: Regenerate inconsistent logo-1x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801403 (https://phabricator.wikimedia.org/T309544) (owner: 10Stang) [16:58:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:58:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T60674)', diff saved to https://phabricator.wikimedia.org/P29050 and previous config saved to /var/cache/conftool/dbconfig/20220530-165854-ladsgroup.json [16:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:03] (03Merged) 10jenkins-bot: redfish: allow to submit tasks with DELETE [software/spicerack] - 10https://gerrit.wikimedia.org/r/801411 (owner: 10Volans) [16:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:04] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T1700). [17:00:22] (03PS2) 10Stang: enwiki: Regenerate inconsistent logo-1x [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801403 (https://phabricator.wikimedia.org/T309544) [17:00:44] (03CR) 10Stang: enwiki: Regenerate inconsistent logo-1x (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801403 (https://phabricator.wikimedia.org/T309544) (owner: 10Stang) [17:01:14] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host restbase2027.codfw.wmnet with OS buster [17:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:30] !log adding restbase2027-a to cassandra cluster [17:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P29051 and previous config saved to /var/cache/conftool/dbconfig/20220530-170539-ladsgroup.json [17:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:19] PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T60674)', diff saved to https://phabricator.wikimedia.org/P29052 and previous config saved to /var/cache/conftool/dbconfig/20220530-170733-ladsgroup.json [17:07:34] that's me, but it should be downtimed... [17:07:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:07:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [17:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:07:39] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:07:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T60674)', diff saved to https://phabricator.wikimedia.org/P29053 and previous config saved to /var/cache/conftool/dbconfig/20220530-170747-ladsgroup.json [17:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:58] ah, just ended [17:07:59] volans: did you mean ms-be2066? [17:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:07] sukhe: yes ms-be2066 it's me [17:08:09] thank you [17:09:24] 10SRE, 10ops-codfw: Degraded RAID on ms-be2066 - https://phabricator.wikimedia.org/T309553 (10Volans) 05Open→03Resolved a:03Volans False alarm confirmed. [17:09:33] sorry fo rthe noise [17:09:57] np please! [17:10:15] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:10:42] (03CR) 10STran: [C: 03+1] "Confirmed off-band we're okay with it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [17:11:33] !log volans@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2066.codfw.wmnet [17:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T60674)', diff saved to https://phabricator.wikimedia.org/P29054 and previous config saved to /var/cache/conftool/dbconfig/20220530-171422-ladsgroup.json [17:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:29] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:15:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T60674)', diff saved to https://phabricator.wikimedia.org/P29055 and previous config saved to /var/cache/conftool/dbconfig/20220530-171546-ladsgroup.json [17:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:01] 10SRE-swift-storage, 10Infrastructure-Foundations: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs - https://phabricator.wikimedia.org/T309027 (10Volans) I spoke with @MatthewVernon and he kindly gave me `ms-be2066` (pre-production host) to test the conversion of SSD disks from RAID PVs to non... [17:17:41] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:20:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P29056 and previous config saved to /var/cache/conftool/dbconfig/20220530-172044-ladsgroup.json [17:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:19] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:23:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Paladox) [17:29:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P29057 and previous config saved to /var/cache/conftool/dbconfig/20220530-172927-ladsgroup.json [17:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P29058 and previous config saved to /var/cache/conftool/dbconfig/20220530-173051-ladsgroup.json [17:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:40] (03PS1) 10Muehlenhoff: Add Paladox to contributors [puppet] - 10https://gerrit.wikimedia.org/r/801415 [17:35:47] (03CR) 10Muehlenhoff: [C: 03+2] Add Paladox to contributors [puppet] - 10https://gerrit.wikimedia.org/r/801415 (owner: 10Muehlenhoff) [17:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T309311)', diff saved to https://phabricator.wikimedia.org/P29059 and previous config saved to /var/cache/conftool/dbconfig/20220530-173549-ladsgroup.json [17:35:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance [17:35:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1109.eqiad.wmnet with reason: Maintenance [17:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:57] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:35:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T309311)', diff saved to https://phabricator.wikimedia.org/P29060 and previous config saved to /var/cache/conftool/dbconfig/20220530-173558-ladsgroup.json [17:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:43] tgr_: hi, you're connected to a now depooled db in mwmaint (in eswiki), can you restart the session? [17:39:50] for a while now [17:40:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/801414 (owner: 10Volans) [17:41:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T309311)', diff saved to https://phabricator.wikimedia.org/P29061 and previous config saved to /var/cache/conftool/dbconfig/20220530-174157-ladsgroup.json [17:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:04] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:42:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10MoritzMuehlenhoff) p:05Triage→03Medium [17:43:59] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10MoritzMuehlenhoff) p:05Triage→03Medium [17:44:31] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:44:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P29062 and previous config saved to /var/cache/conftool/dbconfig/20220530-174432-ladsgroup.json [17:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P29063 and previous config saved to /var/cache/conftool/dbconfig/20220530-174556-ladsgroup.json [17:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T60674)', diff saved to https://phabricator.wikimedia.org/P29064 and previous config saved to /var/cache/conftool/dbconfig/20220530-175403-ladsgroup.json [17:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:11] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:56:16] why is there no utc late backport windows today? [17:57:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P29065 and previous config saved to /var/cache/conftool/dbconfig/20220530-175702-ladsgroup.json [17:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:58:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T60674)', diff saved to https://phabricator.wikimedia.org/P29066 and previous config saved to /var/cache/conftool/dbconfig/20220530-175937-ladsgroup.json [17:59:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:59:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:43] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [17:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T60674)', diff saved to https://phabricator.wikimedia.org/P29067 and previous config saved to /var/cache/conftool/dbconfig/20220530-180101-ladsgroup.json [18:01:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:01:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [18:01:59] here [18:02:00] checking [18:02:00] <_joe_> wat [18:02:20] it's a weird one [18:02:24] checking RED [18:02:33] <_joe_> I can see the sites [18:04:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [18:04:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [18:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T60674)', diff saved to https://phabricator.wikimedia.org/P29068 and previous config saved to /var/cache/conftool/dbconfig/20220530-180418-ladsgroup.json [18:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:50] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Matanya) [18:06:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T60674)', diff saved to https://phabricator.wikimedia.org/P29069 and previous config saved to /var/cache/conftool/dbconfig/20220530-180625-ladsgroup.json [18:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:32] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:06:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [18:09:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29070 and previous config saved to /var/cache/conftool/dbconfig/20220530-180908-ladsgroup.json [18:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P29071 and previous config saved to /var/cache/conftool/dbconfig/20220530-181207-ladsgroup.json [18:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [18:14:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [18:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T60674)', diff saved to https://phabricator.wikimedia.org/P29072 and previous config saved to /var/cache/conftool/dbconfig/20220530-181503-ladsgroup.json [18:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:13] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:20:16] (03PS1) 10Muehlenhoff: Add Matanya to contributors [puppet] - 10https://gerrit.wikimedia.org/r/801417 [18:21:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P29073 and previous config saved to /var/cache/conftool/dbconfig/20220530-182130-ladsgroup.json [18:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:44] (03CR) 10Muehlenhoff: [C: 03+2] Add Matanya to contributors [puppet] - 10https://gerrit.wikimedia.org/r/801417 (owner: 10Muehlenhoff) [18:24:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29074 and previous config saved to /var/cache/conftool/dbconfig/20220530-182413-ladsgroup.json [18:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T309311)', diff saved to https://phabricator.wikimedia.org/P29075 and previous config saved to /var/cache/conftool/dbconfig/20220530-182712-ladsgroup.json [18:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:18] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [18:29:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [18:29:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [18:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T60674)', diff saved to https://phabricator.wikimedia.org/P29076 and previous config saved to /var/cache/conftool/dbconfig/20220530-182920-ladsgroup.json [18:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:30] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:31:09] PROBLEM - cassandra-c CQL 10.192.48.184:9042 on restbase2027 is CRITICAL: connect to address 10.192.48.184 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:31:09] PROBLEM - cassandra-a CQL 10.192.48.182:9042 on restbase2027 is CRITICAL: connect to address 10.192.48.182 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:31:09] PROBLEM - cassandra-b CQL 10.192.48.183:9042 on restbase2027 is CRITICAL: connect to address 10.192.48.183 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [18:31:13] PROBLEM - cassandra-b SSL 10.192.48.183:7001 on restbase2027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:13] PROBLEM - cassandra-c SSL 10.192.48.184:7001 on restbase2027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:31:25] PROBLEM - cassandra-b service on restbase2027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:31:25] PROBLEM - cassandra-c service on restbase2027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:32:20] oh oh [18:32:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T60674)', diff saved to https://phabricator.wikimedia.org/P29077 and previous config saved to /var/cache/conftool/dbconfig/20220530-183245-ladsgroup.json [18:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P29078 and previous config saved to /var/cache/conftool/dbconfig/20220530-183635-ladsgroup.json [18:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:13] jouncebot: nowandnext [18:37:13] No deployments scheduled for the next 2 hour(s) and 22 minute(s) [18:37:13] In 2 hour(s) and 22 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T2100) [18:38:27] (03CR) 10Urbanecm: [C: 03+2] throttle: Add new throttle rule + remove expired ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800711 (https://phabricator.wikimedia.org/T309395) (owner: 10Urbanecm) [18:39:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T60674)', diff saved to https://phabricator.wikimedia.org/P29079 and previous config saved to /var/cache/conftool/dbconfig/20220530-183918-ladsgroup.json [18:39:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:39:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:25] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:39:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T60674)', diff saved to https://phabricator.wikimedia.org/P29080 and previous config saved to /var/cache/conftool/dbconfig/20220530-183926-ladsgroup.json [18:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:30] (03Merged) 10jenkins-bot: throttle: Add new throttle rule + remove expired ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800711 (https://phabricator.wikimedia.org/T309395) (owner: 10Urbanecm) [18:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:14] scap's taking long time... [18:43:19] !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: 6bd5783cd86a74c36d475267974384482b2e534f: throttle: Add new throttle rule + remove expired ones (T309395) (duration: 03m 18s) [18:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:24] T309395: IP throttle lift request for a Czech editaton in Prague - https://phabricator.wikimedia.org/T309395 [18:43:27] finally [18:44:06] !log Clear IP signup throttle for 193.86.226.2 (T309395) [18:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:37] * urbanecm done [18:45:37] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P29081 and previous config saved to /var/cache/conftool/dbconfig/20220530-184750-ladsgroup.json [18:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:49:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T60674)', diff saved to https://phabricator.wikimedia.org/P29082 and previous config saved to /var/cache/conftool/dbconfig/20220530-185140-ladsgroup.json [18:51:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:51:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:48] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:51:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T60674)', diff saved to https://phabricator.wikimedia.org/P29083 and previous config saved to /var/cache/conftool/dbconfig/20220530-185149-ladsgroup.json [18:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T60674)', diff saved to https://phabricator.wikimedia.org/P29084 and previous config saved to /var/cache/conftool/dbconfig/20220530-185635-ladsgroup.json [18:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T60674)', diff saved to https://phabricator.wikimedia.org/P29085 and previous config saved to /var/cache/conftool/dbconfig/20220530-185913-ladsgroup.json [18:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:21] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [18:59:55] (03CR) 10Jforrester: [C: 03+1] "Nice spot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE)) [19:00:17] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:01:01] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (15) node(s) change every puppet run: cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, netbox1002, netbox2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [19:02:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P29086 and previous config saved to /var/cache/conftool/dbconfig/20220530-190255-ladsgroup.json [19:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:39] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:09:44] (03CR) 10AGueyte: Deploy SimilarEditors to the beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799012 (https://phabricator.wikimedia.org/T306908) (owner: 10AGueyte) [19:11:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29087 and previous config saved to /var/cache/conftool/dbconfig/20220530-191140-ladsgroup.json [19:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:55] (03PS3) 10AGueyte: Assign similareditors right to the checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) [19:12:09] (03CR) 10AGueyte: Assign similareditors right to the checkuser group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799378 (https://phabricator.wikimedia.org/T307205) (owner: 10AGueyte) [19:14:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P29088 and previous config saved to /var/cache/conftool/dbconfig/20220530-191418-ladsgroup.json [19:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:26] (03PS1) 10Stang: zhwiktionary: Create namespace "Thesaurus" and "Citations" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801420 (https://phabricator.wikimedia.org/T309564) [19:18:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T60674)', diff saved to https://phabricator.wikimedia.org/P29089 and previous config saved to /var/cache/conftool/dbconfig/20220530-191800-ladsgroup.json [19:18:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:18:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:08] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:18:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T60674)', diff saved to https://phabricator.wikimedia.org/P29090 and previous config saved to /var/cache/conftool/dbconfig/20220530-191808-ladsgroup.json [19:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:17] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:26:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29091 and previous config saved to /var/cache/conftool/dbconfig/20220530-192645-ladsgroup.json [19:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:47] (03PS1) 10Krinkle: MetaContactPages: Update reference to `ext.wikimediamessages.contactpage` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801423 [19:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P29092 and previous config saved to /var/cache/conftool/dbconfig/20220530-192923-ladsgroup.json [19:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T60674)', diff saved to https://phabricator.wikimedia.org/P29093 and previous config saved to /var/cache/conftool/dbconfig/20220530-193405-ladsgroup.json [19:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:13] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:41:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T60674)', diff saved to https://phabricator.wikimedia.org/P29094 and previous config saved to /var/cache/conftool/dbconfig/20220530-194150-ladsgroup.json [19:41:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [19:41:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [19:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:57] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:41:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T60674)', diff saved to https://phabricator.wikimedia.org/P29095 and previous config saved to /var/cache/conftool/dbconfig/20220530-194158-ladsgroup.json [19:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T60674)', diff saved to https://phabricator.wikimedia.org/P29096 and previous config saved to /var/cache/conftool/dbconfig/20220530-194408-ladsgroup.json [19:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T60674)', diff saved to https://phabricator.wikimedia.org/P29097 and previous config saved to /var/cache/conftool/dbconfig/20220530-194428-ladsgroup.json [19:44:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [19:44:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [19:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T60674)', diff saved to https://phabricator.wikimedia.org/P29098 and previous config saved to /var/cache/conftool/dbconfig/20220530-194436-ladsgroup.json [19:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P29099 and previous config saved to /var/cache/conftool/dbconfig/20220530-194910-ladsgroup.json [19:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T60674)', diff saved to https://phabricator.wikimedia.org/P29100 and previous config saved to /var/cache/conftool/dbconfig/20220530-195233-ladsgroup.json [19:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:38] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [19:56:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T60674)', diff saved to https://phabricator.wikimedia.org/P29101 and previous config saved to /var/cache/conftool/dbconfig/20220530-195608-ladsgroup.json [19:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P29102 and previous config saved to /var/cache/conftool/dbconfig/20220530-195913-ladsgroup.json [19:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P29103 and previous config saved to /var/cache/conftool/dbconfig/20220530-200416-ladsgroup.json [20:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P29104 and previous config saved to /var/cache/conftool/dbconfig/20220530-200738-ladsgroup.json [20:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29105 and previous config saved to /var/cache/conftool/dbconfig/20220530-201113-ladsgroup.json [20:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P29106 and previous config saved to /var/cache/conftool/dbconfig/20220530-201418-ladsgroup.json [20:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T60674)', diff saved to https://phabricator.wikimedia.org/P29107 and previous config saved to /var/cache/conftool/dbconfig/20220530-201921-ladsgroup.json [20:19:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [20:19:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [20:19:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:28] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [20:19:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T60674)', diff saved to https://phabricator.wikimedia.org/P29108 and previous config saved to /var/cache/conftool/dbconfig/20220530-201934-ladsgroup.json [20:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:35] !log Restarted ooze job pageview-druid-daily-coord [20:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P29109 and previous config saved to /var/cache/conftool/dbconfig/20220530-202243-ladsgroup.json [20:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:47] RECOVERY - cassandra-a CQL 10.192.48.182:9042 on restbase2027 is OK: TCP OK - 0.033 second response time on 10.192.48.182 port 9042 https://phabricator.wikimedia.org/T93886 [20:26:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29110 and previous config saved to /var/cache/conftool/dbconfig/20220530-202618-ladsgroup.json [20:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T60674)', diff saved to https://phabricator.wikimedia.org/P29111 and previous config saved to /var/cache/conftool/dbconfig/20220530-202923-ladsgroup.json [20:29:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [20:29:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [20:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:30] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [20:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T60674)', diff saved to https://phabricator.wikimedia.org/P29112 and previous config saved to /var/cache/conftool/dbconfig/20220530-203536-ladsgroup.json [20:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:42] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [20:37:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T60674)', diff saved to https://phabricator.wikimedia.org/P29113 and previous config saved to /var/cache/conftool/dbconfig/20220530-203748-ladsgroup.json [20:37:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [20:37:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [20:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [20:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [20:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T60674)', diff saved to https://phabricator.wikimedia.org/P29114 and previous config saved to /var/cache/conftool/dbconfig/20220530-204123-ladsgroup.json [20:41:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [20:41:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [20:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:41:29] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [20:41:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T60674)', diff saved to https://phabricator.wikimedia.org/P29115 and previous config saved to /var/cache/conftool/dbconfig/20220530-204137-ladsgroup.json [20:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [20:43:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [20:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [20:44:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [20:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T60674)', diff saved to https://phabricator.wikimedia.org/P29116 and previous config saved to /var/cache/conftool/dbconfig/20220530-204432-ladsgroup.json [20:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [20:49:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [20:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T60674)', diff saved to https://phabricator.wikimedia.org/P29117 and previous config saved to /var/cache/conftool/dbconfig/20220530-204924-ladsgroup.json [20:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:36] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [20:50:28] (03PS6) 10Jbond: WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [20:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29118 and previous config saved to /var/cache/conftool/dbconfig/20220530-205041-ladsgroup.json [20:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:58] (03CR) 10CI reject: [V: 04-1] WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [20:57:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T60674)', diff saved to https://phabricator.wikimedia.org/P29119 and previous config saved to /var/cache/conftool/dbconfig/20220530-205701-ladsgroup.json [20:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:09] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220530T2100). [21:04:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T60674)', diff saved to https://phabricator.wikimedia.org/P29120 and previous config saved to /var/cache/conftool/dbconfig/20220530-210416-ladsgroup.json [21:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:24] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [21:05:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29121 and previous config saved to /var/cache/conftool/dbconfig/20220530-210546-ladsgroup.json [21:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P29122 and previous config saved to /var/cache/conftool/dbconfig/20220530-211206-ladsgroup.json [21:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T60674)', diff saved to https://phabricator.wikimedia.org/P29123 and previous config saved to /var/cache/conftool/dbconfig/20220530-211621-ladsgroup.json [21:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:28] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [21:19:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29124 and previous config saved to /var/cache/conftool/dbconfig/20220530-211922-ladsgroup.json [21:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T60674)', diff saved to https://phabricator.wikimedia.org/P29125 and previous config saved to /var/cache/conftool/dbconfig/20220530-212051-ladsgroup.json [21:20:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [21:20:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [21:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:27:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P29126 and previous config saved to /var/cache/conftool/dbconfig/20220530-212711-ladsgroup.json [21:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29127 and previous config saved to /var/cache/conftool/dbconfig/20220530-213126-ladsgroup.json [21:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29128 and previous config saved to /var/cache/conftool/dbconfig/20220530-213427-ladsgroup.json [21:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [21:34:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [21:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T60674)', diff saved to https://phabricator.wikimedia.org/P29129 and previous config saved to /var/cache/conftool/dbconfig/20220530-213449-ladsgroup.json [21:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:57] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [21:42:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T60674)', diff saved to https://phabricator.wikimedia.org/P29130 and previous config saved to /var/cache/conftool/dbconfig/20220530-214216-ladsgroup.json [21:42:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [21:42:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [21:42:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [21:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:25] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [21:42:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [21:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T60674)', diff saved to https://phabricator.wikimedia.org/P29131 and previous config saved to /var/cache/conftool/dbconfig/20220530-214230-ladsgroup.json [21:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T60674)', diff saved to https://phabricator.wikimedia.org/P29132 and previous config saved to /var/cache/conftool/dbconfig/20220530-214437-ladsgroup.json [21:44:41] (03PS1) 10Jforrester: Follow-up I1dee51009: Add url() to list-style-image [skins/Vector] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/801193 (https://phabricator.wikimedia.org/T309374) [21:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29133 and previous config saved to /var/cache/conftool/dbconfig/20220530-214631-ladsgroup.json [21:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:09] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:49:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T60674)', diff saved to https://phabricator.wikimedia.org/P29134 and previous config saved to /var/cache/conftool/dbconfig/20220530-214932-ladsgroup.json [21:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:39] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [21:51:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T60674)', diff saved to https://phabricator.wikimedia.org/P29135 and previous config saved to /var/cache/conftool/dbconfig/20220530-215116-ladsgroup.json [21:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P29136 and previous config saved to /var/cache/conftool/dbconfig/20220530-215942-ladsgroup.json [21:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T60674)', diff saved to https://phabricator.wikimedia.org/P29137 and previous config saved to /var/cache/conftool/dbconfig/20220530-220136-ladsgroup.json [22:01:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [22:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:43] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [22:01:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [22:01:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [22:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [22:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P29138 and previous config saved to /var/cache/conftool/dbconfig/20220530-220622-ladsgroup.json [22:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:21] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:14:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P29139 and previous config saved to /var/cache/conftool/dbconfig/20220530-221447-ladsgroup.json [22:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:01] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:21:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P29140 and previous config saved to /var/cache/conftool/dbconfig/20220530-222127-ladsgroup.json [22:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:15] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Snaevar) >>! In T275319#7947012, @Krinkle wrote: > > **Question**: Is there specific and limited concrete use case here that happpens to require... [22:29:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T60674)', diff saved to https://phabricator.wikimedia.org/P29141 and previous config saved to /var/cache/conftool/dbconfig/20220530-222952-ladsgroup.json [22:29:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [22:29:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [22:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:58] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [22:30:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T60674)', diff saved to https://phabricator.wikimedia.org/P29142 and previous config saved to /var/cache/conftool/dbconfig/20220530-223000-ladsgroup.json [22:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [22:31:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [22:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T60674)', diff saved to https://phabricator.wikimedia.org/P29143 and previous config saved to /var/cache/conftool/dbconfig/20220530-223121-ladsgroup.json [22:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T60674)', diff saved to https://phabricator.wikimedia.org/P29144 and previous config saved to /var/cache/conftool/dbconfig/20220530-223207-ladsgroup.json [22:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T60674)', diff saved to https://phabricator.wikimedia.org/P29145 and previous config saved to /var/cache/conftool/dbconfig/20220530-223632-ladsgroup.json [22:36:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [22:36:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [22:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:39] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [22:36:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T60674)', diff saved to https://phabricator.wikimedia.org/P29146 and previous config saved to /var/cache/conftool/dbconfig/20220530-223640-ladsgroup.json [22:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P29147 and previous config saved to /var/cache/conftool/dbconfig/20220530-224712-ladsgroup.json [22:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:13] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T60674)', diff saved to https://phabricator.wikimedia.org/P29148 and previous config saved to /var/cache/conftool/dbconfig/20220530-225311-ladsgroup.json [22:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:19] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [22:53:31] (03PS2) 10Gergő Tisza: Tombstone the old session on SessionBackend::resetId() [core] (wmf/1.39.0-wmf.13) - 10https://gerrit.wikimedia.org/r/799388 (https://phabricator.wikimedia.org/T299193) [23:02:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P29149 and previous config saved to /var/cache/conftool/dbconfig/20220530-230217-ladsgroup.json [23:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T60674)', diff saved to https://phabricator.wikimedia.org/P29150 and previous config saved to /var/cache/conftool/dbconfig/20220530-230406-ladsgroup.json [23:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:14] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [23:08:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P29151 and previous config saved to /var/cache/conftool/dbconfig/20220530-230816-ladsgroup.json [23:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T60674)', diff saved to https://phabricator.wikimedia.org/P29152 and previous config saved to /var/cache/conftool/dbconfig/20220530-231723-ladsgroup.json [23:17:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [23:17:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [23:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:30] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [23:17:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T60674)', diff saved to https://phabricator.wikimedia.org/P29153 and previous config saved to /var/cache/conftool/dbconfig/20220530-231731-ladsgroup.json [23:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P29154 and previous config saved to /var/cache/conftool/dbconfig/20220530-231911-ladsgroup.json [23:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T60674)', diff saved to https://phabricator.wikimedia.org/P29155 and previous config saved to /var/cache/conftool/dbconfig/20220530-231938-ladsgroup.json [23:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:09] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:23:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P29156 and previous config saved to /var/cache/conftool/dbconfig/20220530-232321-ladsgroup.json [23:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:05] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:29:57] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:32:39] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:34:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P29157 and previous config saved to /var/cache/conftool/dbconfig/20220530-233416-ladsgroup.json [23:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P29158 and previous config saved to /var/cache/conftool/dbconfig/20220530-233443-ladsgroup.json [23:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T60674)', diff saved to https://phabricator.wikimedia.org/P29159 and previous config saved to /var/cache/conftool/dbconfig/20220530-233826-ladsgroup.json [23:38:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [23:38:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [23:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:33] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [23:38:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T60674)', diff saved to https://phabricator.wikimedia.org/P29160 and previous config saved to /var/cache/conftool/dbconfig/20220530-233834-ladsgroup.json [23:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T60674)', diff saved to https://phabricator.wikimedia.org/P29161 and previous config saved to /var/cache/conftool/dbconfig/20220530-234921-ladsgroup.json [23:49:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [23:49:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [23:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:27] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [23:49:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T60674)', diff saved to https://phabricator.wikimedia.org/P29162 and previous config saved to /var/cache/conftool/dbconfig/20220530-234929-ladsgroup.json [23:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P29163 and previous config saved to /var/cache/conftool/dbconfig/20220530-234947-ladsgroup.json [23:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:01] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:54:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T60674)', diff saved to https://phabricator.wikimedia.org/P29164 and previous config saved to /var/cache/conftool/dbconfig/20220530-235432-ladsgroup.json [23:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:39] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674