[00:00:15] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:51] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P26958 and previous config saved to /var/cache/conftool/dbconfig/20220429-000327-ladsgroup.json [00:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26959 and previous config saved to /var/cache/conftool/dbconfig/20220429-001320-ladsgroup.json [00:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:13:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [00:13:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [00:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26960 and previous config saved to /var/cache/conftool/dbconfig/20220429-001333-ladsgroup.json [00:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26961 and previous config saved to /var/cache/conftool/dbconfig/20220429-001832-ladsgroup.json [00:18:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [00:18:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [00:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:39] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [00:18:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26962 and previous config saved to /var/cache/conftool/dbconfig/20220429-001840-ladsgroup.json [00:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26963 and previous config saved to /var/cache/conftool/dbconfig/20220429-002518-ladsgroup.json [00:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:35:08] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-08 phpfpm worker saturation - https://phabricator.wikimedia.org/T307165 (10lmata) [00:36:04] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-10_MediaWiki_availability - https://phabricator.wikimedia.org/T307166 (10lmata) [00:36:38] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-27 api - https://phabricator.wikimedia.org/T307167 (10lmata) [00:36:41] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:37:00] 10SRE-OnFire (FY2021/2022-Q3): Incident: Incidents/2022-03-27 wdqs outage - https://phabricator.wikimedia.org/T307168 (10lmata) [00:37:34] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-29 network - https://phabricator.wikimedia.org/T307169 (10lmata) [00:37:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs2006.mgmt.codfw.wmnet with reboot policy FORCED [00:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:07] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-31 api errors - https://phabricator.wikimedia.org/T307170 (10lmata) [00:38:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2006.mgmt.codfw.wmnet with reboot policy FORCED [00:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26964 and previous config saved to /var/cache/conftool/dbconfig/20220429-004023-ladsgroup.json [00:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26965 and previous config saved to /var/cache/conftool/dbconfig/20220429-004157-ladsgroup.json [00:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:04] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [00:44:02] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-10_MediaWiki_availability - https://phabricator.wikimedia.org/T307166 (10lmata) [00:44:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs2006.mgmt.codfw.wmnet with reboot policy FORCED [00:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:28] 10SRE-OnFire (FY2021/2022-Q3), 10Data-Persistence (Consultation), 10Platform Engineering, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting ... - https://phabricator.wikimedia.org/T303499 [00:45:21] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-27 wdqs outage - https://phabricator.wikimedia.org/T307168 (10lmata) [00:46:02] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-22 eqiad-eqord saturation - https://phabricator.wikimedia.org/T307158 (10lmata) [00:46:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2005.mgmt.codfw.wmnet with reboot policy FORCED [00:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2007.mgmt.codfw.wmnet with reboot policy FORCED [00:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:28] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10lmata) 05Open→03In progress p:05Triage→03Medium [00:50:46] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-06_wdqs_updater - https://phabricator.wikimedia.org/T307156 (10lmata) 05Open→03Resolved a:03lmata scorecard and document on wikitech, resolving [00:51:22] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-10 Envoy overflow - https://phabricator.wikimedia.org/T307157 (10lmata) [00:52:50] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-10 Envoy overflow - https://phabricator.wikimedia.org/T307157 (10lmata) 05Open→03In progress missing metadata [00:53:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2006.mgmt.codfw.wmnet with reboot policy FORCED [00:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:01] PROBLEM - Check systemd state on an-worker1080 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:02] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-22 eqiad-eqord saturation - https://phabricator.wikimedia.org/T307158 (10lmata) scorecard done, missing metadata [00:54:17] PROBLEM - Hadoop NodeManager on an-worker1080 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26966 and previous config saved to /var/cache/conftool/dbconfig/20220429-005528-ladsgroup.json [00:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:39] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-22_vrts - https://phabricator.wikimedia.org/T307159 (10lmata) 05Open→03In progress scorecard complete, missing metadata. [00:56:19] RECOVERY - Check systemd state on an-worker1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:35] RECOVERY - Hadoop NodeManager on an-worker1080 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:56:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs2006.mgmt.codfw.wmnet with reboot policy FORCED [00:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2008.mgmt.codfw.wmnet with reboot policy FORCED [00:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P26967 and previous config saved to /var/cache/conftool/dbconfig/20220429-005702-ladsgroup.json [00:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:55] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-01_ulsfo_network - https://phabricator.wikimedia.org/T307161 (10lmata) 05Open→03In progress scorecard complete, missing metadata. [01:03:21] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:04:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2007.mgmt.codfw.wmnet with reboot policy FORCED [01:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:22] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-29 network - https://phabricator.wikimedia.org/T307169 (10lmata) scorecard is incomplete [01:10:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26968 and previous config saved to /var/cache/conftool/dbconfig/20220429-011033-ladsgroup.json [01:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [01:10:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:10:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [01:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P26969 and previous config saved to /var/cache/conftool/dbconfig/20220429-011046-ladsgroup.json [01:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P26970 and previous config saved to /var/cache/conftool/dbconfig/20220429-011207-ladsgroup.json [01:12:08] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10lmata) [01:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:37] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P26971 and previous config saved to /var/cache/conftool/dbconfig/20220429-012137-ladsgroup.json [01:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:27:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26972 and previous config saved to /var/cache/conftool/dbconfig/20220429-012713-ladsgroup.json [01:27:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:27:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:20] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [01:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [01:27:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [01:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [01:27:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [01:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [01:27:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [01:27:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [01:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [01:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [01:28:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [01:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T306560)', diff saved to https://phabricator.wikimedia.org/P26973 and previous config saved to /var/cache/conftool/dbconfig/20220429-012827-ladsgroup.json [01:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T306560)', diff saved to https://phabricator.wikimedia.org/P26974 and previous config saved to /var/cache/conftool/dbconfig/20220429-013141-ladsgroup.json [01:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2008.mgmt.codfw.wmnet with reboot policy FORCED [01:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2009.mgmt.codfw.wmnet with reboot policy FORCED [01:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26975 and previous config saved to /var/cache/conftool/dbconfig/20220429-013642-ladsgroup.json [01:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2010.mgmt.codfw.wmnet with reboot policy FORCED [01:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:12] (03PS20) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [01:46:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P26976 and previous config saved to /var/cache/conftool/dbconfig/20220429-014646-ladsgroup.json [01:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:05] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [01:49:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26977 and previous config saved to /var/cache/conftool/dbconfig/20220429-015147-ladsgroup.json [01:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:58] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.07 ms [01:52:11] (03PS21) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [01:54:02] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [01:56:18] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:01:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P26978 and previous config saved to /var/cache/conftool/dbconfig/20220429-020151-ladsgroup.json [02:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Papaul) @Cmjohnson what i am seeing in the partman recipe that the server is using ,line 10 is removing any existing LVM ` 10... [02:06:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P26979 and previous config saved to /var/cache/conftool/dbconfig/20220429-020652-ladsgroup.json [02:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [02:06:59] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:07:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [02:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P26980 and previous config saved to /var/cache/conftool/dbconfig/20220429-020705-ladsgroup.json [02:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2009.mgmt.codfw.wmnet with reboot policy FORCED [02:07:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2010.mgmt.codfw.wmnet with reboot policy FORCED [02:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2011.mgmt.codfw.wmnet with reboot policy FORCED [02:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2012.mgmt.codfw.wmnet with reboot policy FORCED [02:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T306560)', diff saved to https://phabricator.wikimedia.org/P26981 and previous config saved to /var/cache/conftool/dbconfig/20220429-021657-ladsgroup.json [02:16:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [02:17:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [02:17:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [02:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:04] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [02:17:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [02:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T306560)', diff saved to https://phabricator.wikimedia.org/P26982 and previous config saved to /var/cache/conftool/dbconfig/20220429-021710-ladsgroup.json [02:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P26983 and previous config saved to /var/cache/conftool/dbconfig/20220429-021735-ladsgroup.json [02:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:19:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T306560)', diff saved to https://phabricator.wikimedia.org/P26984 and previous config saved to /var/cache/conftool/dbconfig/20220429-021924-ladsgroup.json [02:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2011.mgmt.codfw.wmnet with reboot policy FORCED [02:32:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26985 and previous config saved to /var/cache/conftool/dbconfig/20220429-023240-ladsgroup.json [02:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2012.mgmt.codfw.wmnet with reboot policy FORCED [02:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:10] (03PS22) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [02:34:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P26986 and previous config saved to /var/cache/conftool/dbconfig/20220429-023429-ladsgroup.json [02:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:19] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) [02:36:01] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [02:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:44:01] (03PS23) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [02:45:56] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [02:47:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26987 and previous config saved to /var/cache/conftool/dbconfig/20220429-024745-ladsgroup.json [02:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P26988 and previous config saved to /var/cache/conftool/dbconfig/20220429-024934-ladsgroup.json [02:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:32] (03PS24) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [02:52:33] (03PS25) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [02:53:07] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [02:54:56] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10lmata) [02:55:33] (03PS26) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [02:57:24] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:02:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P26989 and previous config saved to /var/cache/conftool/dbconfig/20220429-030250-ladsgroup.json [03:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [03:02:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:02:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [03:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P26990 and previous config saved to /var/cache/conftool/dbconfig/20220429-030303-ladsgroup.json [03:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T306560)', diff saved to https://phabricator.wikimedia.org/P26991 and previous config saved to /var/cache/conftool/dbconfig/20220429-030439-ladsgroup.json [03:04:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [03:04:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [03:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:46] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [03:04:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T306560)', diff saved to https://phabricator.wikimedia.org/P26992 and previous config saved to /var/cache/conftool/dbconfig/20220429-030447-ladsgroup.json [03:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T306560)', diff saved to https://phabricator.wikimedia.org/P26993 and previous config saved to /var/cache/conftool/dbconfig/20220429-030900-ladsgroup.json [03:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P26994 and previous config saved to /var/cache/conftool/dbconfig/20220429-031328-ladsgroup.json [03:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:14:55] (03PS27) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [03:17:27] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:24:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P26995 and previous config saved to /var/cache/conftool/dbconfig/20220429-032405-ladsgroup.json [03:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26996 and previous config saved to /var/cache/conftool/dbconfig/20220429-032833-ladsgroup.json [03:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P26997 and previous config saved to /var/cache/conftool/dbconfig/20220429-033910-ladsgroup.json [03:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:43:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P26998 and previous config saved to /var/cache/conftool/dbconfig/20220429-034338-ladsgroup.json [03:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:54:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T306560)', diff saved to https://phabricator.wikimedia.org/P26999 and previous config saved to /var/cache/conftool/dbconfig/20220429-035415-ladsgroup.json [03:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:22] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [03:58:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P27000 and previous config saved to /var/cache/conftool/dbconfig/20220429-035843-ladsgroup.json [03:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:10:03] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [05:06:49] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:08:47] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:17:27] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:25:01] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Marostegui) I am seeing the procurement tasks being processed already, does that mean we have established that this controller will work 100% for us then? [05:27:48] (03PS1) 10Marostegui: pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/787580 [05:28:36] (03CR) 10Marostegui: [C: 03+2] pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/787580 (owner: 10Marostegui) [06:30:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 T301879', diff saved to https://phabricator.wikimedia.org/P27001 and previous config saved to /var/cache/conftool/dbconfig/20220429-063019-marostegui.json [06:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:28] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [06:31:19] (03PS1) 10Marostegui: db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/787583 [06:31:57] (03CR) 10Marostegui: [C: 03+2] db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/787583 (owner: 10Marostegui) [06:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:58:36] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 42.5 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220429T0700) [07:00:50] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:07:56] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 14 hosts with reason: Reimaging db2103 T303171 [07:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:03] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [07:08:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 14 hosts with reason: Reimaging db2103 T303171 [07:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2103.codfw.wmnet with reason: Rebooting for T303171 [07:09:03] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2103.codfw.wmnet with reason: Rebooting for T303171 [07:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see a minor (optional) improvement to the template." [puppet] - 10https://gerrit.wikimedia.org/r/784798 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [07:18:04] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10MoritzMuehlenhoff) >>! In T297913#7890055, @Marostegui wrote: > I am seeing the procurement tasks being processed already, does that mean we have established that this controller will work 100% for us th... [07:22:21] (03PS1) 10Muehlenhoff: Remove expiry date for kcv-wikimf [puppet] - 10https://gerrit.wikimedia.org/r/787686 [07:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:26:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2040.codfw.wmnet with OS bullseye [07:26:11] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye [07:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:08] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Add the needed relationship chain." [puppet] - 10https://gerrit.wikimedia.org/r/784761 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [07:27:17] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2103.codfw.wmnet with OS bullseye [07:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:11] (03CR) 10Muehlenhoff: [C: 03+2] Remove expiry date for kcv-wikimf [puppet] - 10https://gerrit.wikimedia.org/r/787686 (owner: 10Muehlenhoff) [07:34:50] (03CR) 10Giuseppe Lavagetto: "The code LGTM; however I'd prefer a different solution, where we use a single header for this." [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [07:37:29] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787689 (https://phabricator.wikimedia.org/T307110) (owner: 10Awight) [07:40:55] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2103.codfw.wmnet with reason: host reimage [07:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2103.codfw.wmnet with reason: host reimage [07:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:42] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Marostegui) Thanks @MoritzMuehlenhoff! [07:46:30] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) (owner: 10Awight) [07:49:06] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable the versioned mapdata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787689 (https://phabricator.wikimedia.org/T307110) (owner: 10Awight) [07:49:28] (03CR) 10Filippo Giunchedi: "The basic idea (if I understood correctly) looks good to me, thank you! I propose a quick chat next week to hash out ideas/thoughts in a h" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [07:49:38] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable CodeMirror colorblind-friendly palette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) (owner: 10Awight) [07:50:01] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2040.codfw.wmnet with OS bullseye [07:50:06] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye executed with errors: - ms-be2040 (**FAIL**)... [07:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:51:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2040.codfw.wmnet with OS bullseye [07:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:06] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye [07:53:44] (03CR) 10Filippo Giunchedi: "LGTM, modulo possibly unrelated change" [puppet] - 10https://gerrit.wikimedia.org/r/787109 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [07:55:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2103.codfw.wmnet with OS bullseye [07:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:31] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1163.eqiad.wmnet with reason: Rebooting for T303171 [08:00:33] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1163.eqiad.wmnet with reason: Rebooting for T303171 [08:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:38] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [08:00:38] !log kormat@cumin1001 dbctl commit (dc=all): 'db1163 depooling: Rebooting for T303171', diff saved to https://phabricator.wikimedia.org/P27003 and previous config saved to /var/cache/conftool/dbconfig/20220429-080038-kormat.json [08:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [08:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:58] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [08:02:12] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db1163.eqiad.wmnet with OS bullseye [08:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:14] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw1323.eqiad.wmnet [08:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [08:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:28] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1163.eqiad.wmnet with reason: host reimage [08:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:55] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2040.codfw.wmnet with OS bullseye [08:14:59] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye executed with errors: - ms-be2040 (**FAIL**)... [08:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1163.eqiad.wmnet with reason: host reimage [08:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:41] (03CR) 10WMDE-Fisch: [C: 03+1] Enable the versioned mapdata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787689 (https://phabricator.wikimedia.org/T307110) (owner: 10Awight) [08:21:43] !log scap pull on mw1323 [08:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:05] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2040.codfw.wmnet with OS bullseye [08:27:10] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye [08:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1163.eqiad.wmnet with OS bullseye [08:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:50] !log kormat@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: Reboot T303171', diff saved to https://phabricator.wikimedia.org/P27004 and previous config saved to /var/cache/conftool/dbconfig/20220429-082850-kormat.json [08:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:57] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [08:30:41] (03CR) 10WMDE-Fisch: [C: 03+1] Enable CodeMirror colorblind-friendly palette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) (owner: 10Awight) [08:33:33] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw1323.eqiad.wmnet [08:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:58] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-01_ulsfo_network - https://phabricator.wikimedia.org/T307154 (10jcrespo) Done, the incident actually happened on the 2022-01-31 UTC, which confused me. [08:34:06] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-01_ulsfo_network - https://phabricator.wikimedia.org/T307154 (10jcrespo) a:03jcrespo [08:34:47] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10jcrespo) [08:35:11] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-01_ulsfo_network - https://phabricator.wikimedia.org/T307154 (10jcrespo) 05Open→03Resolved [08:36:00] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10jcrespo) [08:37:32] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10jcrespo) [08:43:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: Reboot T303171', diff saved to https://phabricator.wikimedia.org/P27005 and previous config saved to /var/cache/conftool/dbconfig/20220429-084354-kormat.json [08:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:03] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [08:46:39] !log restarting blazegraph on wdqs1006 (deadlocked for 18hours, T242453) [08:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:45] T242453: Detect and alert and/or remediate Blazegraph deadlocks - https://phabricator.wikimedia.org/T242453 [08:47:04] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.117 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:47:16] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:58:46] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2040.codfw.wmnet with reason: host reimage [08:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Reboot T303171', diff saved to https://phabricator.wikimedia.org/P27006 and previous config saved to /var/cache/conftool/dbconfig/20220429-085858-kormat.json [08:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:05] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [09:02:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2040.codfw.wmnet with reason: host reimage [09:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:03] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10jcrespo) FYI: this template needs update for the new scoring: https://docs.google.com/document/d/1uZgzqURvPGdTw-GaG7gcIcStSR1bXBfsDszQ2M3tuWo/edit [09:04:56] (Storage /var over 50%) firing: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [09:09:34] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) [09:13:16] (03CR) 10Jbond: "thanks both ill mark this as WIP until we have discussed in the meeting" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [09:14:02] !log kormat@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Reboot T303171', diff saved to https://phabricator.wikimedia.org/P27007 and previous config saved to /var/cache/conftool/dbconfig/20220429-091401-kormat.json [09:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:09] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [09:14:56] (Storage /var over 50%) firing: (2) Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [09:20:12] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2006-dev.codfw.wmnet with OS bullseye [09:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudnet2006-dev.codfw.wmne... [09:20:28] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:24:48] (03CR) 10Jbond: service: add new module to expose service::catalog (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [09:25:35] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-cluster [09:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:53] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2040.codfw.wmnet with OS bullseye [09:27:56] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye completed: - ms-be2040 (**PASS**) - Removed... [09:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:03] PROBLEM - very high load average likely xfs on ms-be2040 is CRITICAL: CRITICAL - load average: 200.20, 104.28, 43.74 https://wikitech.wikimedia.org/wiki/Swift [09:33:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2002.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [09:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:56] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) [09:35:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti-test2002.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [09:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1164.eqiad.wmnet with reason: Rebooting for T303171 [09:36:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1164.eqiad.wmnet with reason: Rebooting for T303171 [09:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:13] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [09:36:14] !log kormat@cumin1001 dbctl commit (dc=all): 'db1164 depooling: Rebooting for T303171', diff saved to https://phabricator.wikimedia.org/P27008 and previous config saved to /var/cache/conftool/dbconfig/20220429-093613-kormat.json [09:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:10] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [09:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:16] 10SRE, 10Beta-Cluster-Infrastructure: deployment-cache-upload05: Several millions of logstash error entries - https://phabricator.wikimedia.org/T243129 (10fgiunchedi) 05Open→03Declined Can't find any occurrence now, declining [09:39:56] (Storage /var over 50%) resolved: (2) Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [09:41:36] (03CR) 10Kormat: [C: 03+1] auto_schema: Wrap starting replication with finally [software] - 10https://gerrit.wikimedia.org/r/775895 (owner: 10Ladsgroup) [09:42:12] (03PS4) 10Kormat: mariadb: Use ROW binlog_format for db_inventory. [puppet] - 10https://gerrit.wikimedia.org/r/775330 (https://phabricator.wikimedia.org/T301315) [09:42:12] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [09:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:44] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db1164.eqiad.wmnet with OS bullseye [09:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:06] 10SRE, 10audits-data-retention: Implement Data Retention Guidelines - https://phabricator.wikimedia.org/T83531 (10fgiunchedi) [09:43:37] (03CR) 10Marostegui: [C: 03+1] mariadb: Use ROW binlog_format for db_inventory. [puppet] - 10https://gerrit.wikimedia.org/r/775330 (https://phabricator.wikimedia.org/T301315) (owner: 10Kormat) [09:44:57] (03CR) 10Kormat: [C: 03+2] mariadb: Use ROW binlog_format for db_inventory. [puppet] - 10https://gerrit.wikimedia.org/r/775330 (https://phabricator.wikimedia.org/T301315) (owner: 10Kormat) [09:45:38] (03CR) 10Jbond: [C: 04-1] "see comment for -1, otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [09:47:44] (03PS6) 10Kormat: mariadb: Reference the actual VRTS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) [09:47:48] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787699 (https://phabricator.wikimedia.org/T296759) (owner: 10Awight) [09:47:59] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787698 (owner: 10Awight) [09:48:52] 10SRE: mwdeploy does not have the same user ID on all Apaches - https://phabricator.wikimedia.org/T79786 (10fgiunchedi) 05Open→03Invalid I'm boldly resolving this old task since AFAICS the infra and deploy methods have changed enough that this hasn't surfaced as a problem anymore. [09:51:28] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35004/console" [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat) [09:52:41] (03CR) 10Kormat: [V: 03+1 C: 03+2] mariadb: Reference the actual VRTS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat) [09:54:11] 10SRE: Make inventory of (private) data backups on all systems - https://phabricator.wikimedia.org/T83522 (10fgiunchedi) 05Open→03Invalid Resolving this old task since likely obsolete at this point [09:54:13] 10SRE, 10audits-data-retention: Implement Data Retention Guidelines - https://phabricator.wikimedia.org/T83531 (10fgiunchedi) [09:56:50] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2006-dev.codfw.wmnet with OS bullseye [09:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudnet2006-dev.codfw.wmnet wi... [09:57:20] !log drain ganeti-test2003 T306499 [09:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:26] T306499: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 [09:57:32] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1040.eqiad.wmnet with OS bullseye [09:57:36] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be1040.eqiad.wmnet with OS bullseye [09:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:50] 10SRE, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10fgiunchedi) [09:57:52] 10SRE: Setup basic infrastructure services in codfw - https://phabricator.wikimedia.org/T84350 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving as completed since codfw has been up and running for years now [09:58:43] ^ \o/ [09:59:09] hehehe, the joys of SRE clinic duty [10:03:37] 10SRE: Make ops-l a list for humans again (no cheating) - https://phabricator.wikimedia.org/T117508 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I think nowadays `ops@` is in a pretty good place WRT automated emails, boldly resolving! [10:05:08] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:13] 10SRE: Weak digest algorithm (SHA1) used to sign InRelease on apt.wikimedia.org - https://phabricator.wikimedia.org/T132325 (10fgiunchedi) 05Open→03Declined I don't think this is relevant anymore (and I personally haven't seen the warning either), therefore I'm declining the task [10:06:57] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1164.eqiad.wmnet with OS bullseye [10:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:45] 10SRE, 10Technical-Debt: Reduce etcd technical debt - https://phabricator.wikimedia.org/T135122 (10fgiunchedi) [10:07:51] 10SRE: Install a second etcd cluster in codfw - https://phabricator.wikimedia.org/T135125 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi We do have conf2* up and running nowadays, resolving [10:08:49] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1040.eqiad.wmnet with reason: host reimage [10:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:42] (03PS2) 10Jbond: puppetdb: add query_facts function [puppet] - 10https://gerrit.wikimedia.org/r/787547 [10:11:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1040.eqiad.wmnet with reason: host reimage [10:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:56] (03PS3) 10Jbond: C:ssh::publish_fingerprints: update to use new_query facts function [puppet] - 10https://gerrit.wikimedia.org/r/787548 [10:13:07] 10SRE, 10Phabricator: Phabricator leaving old files in /tmp - https://phabricator.wikimedia.org/T150396 (10fgiunchedi) 05Open→03Invalid Doesn't seem to be an issue anymore, must have got fixed at some point in Phab's release cycle [10:14:16] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Clean up unnecessary two-level setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787698 (owner: 10Awight) [10:15:55] 10SRE: Internal PKI for secure communication - Barcelona Ops offsite 2016 - https://phabricator.wikimedia.org/T150822 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi We do have a PKI now (e.g. T194031) so I'm going to resolve this task. cc @jbond in case there's material (hah!) here that could be useful as... [10:19:24] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I 've been meaning to remove this module finally now that kubernetes hosts (the only users I know of) are no longer using the docker devic" [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [10:19:37] (03PS1) 10Muehlenhoff: Update d-i setting to not prompt for firmware [puppet] - 10https://gerrit.wikimedia.org/r/787704 [10:20:09] (03PS3) 10Jbond: puppetdb: add query_facts function [puppet] - 10https://gerrit.wikimedia.org/r/787547 [10:20:42] (03PS4) 10Jbond: C:ssh::publish_fingerprints: update to use new_query facts function [puppet] - 10https://gerrit.wikimedia.org/r/787548 [10:27:19] 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10Kormat) [10:30:50] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1040.eqiad.wmnet with OS bullseye [10:30:54] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be1040.eqiad.wmnet with OS bullseye completed: - ms-be1040 (**PASS**) - Downtim... [10:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:17] 10SRE, 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10Kormat) On the web console, i can see it get as far as this before resetting: {F35073130} [10:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:43] (03PS1) 10Alexandros Kosiaris: Remove the puppet lvm module [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) [10:40:08] (03CR) 10Alexandros Kosiaris: "sudo cumin 'R:class=lvm'" [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) (owner: 10Alexandros Kosiaris) [10:40:45] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Started the removal in https://gerrit.wikimedia.org/r/c/operations/puppet/+/787708" [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [10:42:23] (03CR) 10Alexandros Kosiaris: "Fleet wide PCC running as id: 35008" [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) (owner: 10Alexandros Kosiaris) [10:43:30] PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:59] (03Abandoned) 10Jgiannelos: Reduce production image size [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/702661 (owner: 10Jgiannelos) [10:44:17] (03PS4) 10Jgiannelos: Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) [10:44:30] PROBLEM - very high load average likely xfs on ms-be1040 is CRITICAL: CRITICAL - load average: 285.39, 175.65, 84.65 https://wikitech.wikimedia.org/wiki/Swift [10:45:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I see profile::swift::storage::labs uses this too. I 'll reach out to WMCS to make sure no VM uses this (nothing in production seems to us" [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) (owner: 10Alexandros Kosiaris) [10:45:22] (03CR) 10Jgiannelos: [C: 03+1] Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [10:47:27] (03Abandoned) 10Jgiannelos: Add changeprop rules for DelayeEchoNotificationJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/636416 (owner: 10Jgiannelos) [10:49:40] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10akosiaris) https://gerrit.wikimedia.org/r/787708 removes the puppet lvm module which is GPL-2 and incompatible with apache 2.0. So that removes an... [10:50:50] PROBLEM - Host ms-be2040 is DOWN: PING CRITICAL - Packet loss = 100% [10:52:31] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:17] RECOVERY - very high load average likely xfs on ms-be2040 is OK: OK - load average: 13.89, 3.55, 1.20 https://wikitech.wikimedia.org/wiki/Swift [10:53:19] RECOVERY - Host ms-be2040 is UP: PING OK - Packet loss = 0%, RTA = 31.55 ms [10:55:48] (03PS1) 10Ayounsi: Remove QFX5120 hack [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/787709 [10:56:45] (03PS4) 10Jbond: puppetdb: add query_facts function [puppet] - 10https://gerrit.wikimedia.org/r/787547 [10:58:04] (03PS5) 10Jbond: C:ssh::publish_fingerprints: update to use new_query facts function [puppet] - 10https://gerrit.wikimedia.org/r/787548 [11:00:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35009/console" [puppet] - 10https://gerrit.wikimedia.org/r/787548 (owner: 10Jbond) [11:01:25] RECOVERY - very high load average likely xfs on ms-be1040 is OK: OK - load average: 13.18, 4.15, 1.47 https://wikitech.wikimedia.org/wiki/Swift [11:01:27] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container-reconciler.service,swift-container-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:17] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:24] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/787709 (owner: 10Ayounsi) [11:05:21] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable new template dialog sidebar everywhere except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787699 (https://phabricator.wikimedia.org/T296759) (owner: 10Awight) [11:08:43] PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container-reconciler.service,swift-container-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:53] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10MatthewVernon) [11:14:22] 10SRE, 10SRE-swift-storage: reimaging swift backends should set swift UID/GID to match filesystems - https://phabricator.wikimedia.org/T300057 (10MatthewVernon) 05Resolved→03Open This doesn't work, because while busybox seems to have `stat` the installer can't find it?!? ` Apr 29 10:05:33 log-output: + mkt... [11:15:26] (03CR) 10Ayounsi: [C: 03+2] Remove QFX5120 hack [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/787709 (owner: 10Ayounsi) [11:16:04] (03Merged) 10jenkins-bot: Remove QFX5120 hack [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/787709 (owner: 10Ayounsi) [11:19:14] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:ssh::publish_fingerprints: update to use new_query facts function [puppet] - 10https://gerrit.wikimedia.org/r/787548 (owner: 10Jbond) [11:19:19] (03CR) 10Jbond: [C: 03+2] puppetdb: add query_facts function [puppet] - 10https://gerrit.wikimedia.org/r/787547 (owner: 10Jbond) [11:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:24:19] 10SRE-OnFire (FY2021/2022-Q2): ONFIRE Q2 Undisclosed incidents scoring (NDA) - https://phabricator.wikimedia.org/T307202 (10jcrespo) [11:24:46] 10SRE-OnFire (FY2021/2022-Q2): ONFIRE Q2 Undisclosed incidents scoring (NDA) - https://phabricator.wikimedia.org/T307202 (10jcrespo) [11:28:51] (03PS1) 10Jbond: puppetdb::query_facts: add return type hint [puppet] - 10https://gerrit.wikimedia.org/r/787713 [11:29:18] 10SRE-OnFire (FY2021/2022-Q2): ONFIRE Q2 Undisclosed incidents scoring (NDA) - https://phabricator.wikimedia.org/T307202 (10jcrespo) [11:30:17] PROBLEM - Check systemd state on ms-be1063 is CRITICAL: CRITICAL - degraded: The following units failed: session-325816.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:42] (03CR) 10Jbond: [C: 03+2] puppetdb::query_facts: add return type hint [puppet] - 10https://gerrit.wikimedia.org/r/787713 (owner: 10Jbond) [11:33:11] PROBLEM - Disk space on ms-be1040 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdc1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops [11:34:18] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-14_Heavy_outbound_traffic - https://phabricator.wikimedia.org/T307149 (10jcrespo) [11:36:25] (03PS1) 10Jbond: Revert "C:ssh::publish_fingerprints: update to use new_query fac..." [puppet] - 10https://gerrit.wikimedia.org/r/787740 [11:36:28] (03PS1) 10Jbond: Revert "puppetdb: add query_facts function" [puppet] - 10https://gerrit.wikimedia.org/r/787741 [11:36:43] (03CR) 10Muehlenhoff: [C: 03+2] Update d-i setting to not prompt for firmware [puppet] - 10https://gerrit.wikimedia.org/r/787704 (owner: 10Muehlenhoff) [11:36:57] (03PS2) 10Jbond: Revert "C:ssh::publish_fingerprints: update to use new_query fac..." [puppet] - 10https://gerrit.wikimedia.org/r/787740 [11:37:04] (03CR) 10jerkins-bot: [V: 04-1] Revert "puppetdb: add query_facts function" [puppet] - 10https://gerrit.wikimedia.org/r/787741 (owner: 10Jbond) [11:37:11] (03Abandoned) 10Jbond: Revert "puppetdb: add query_facts function" [puppet] - 10https://gerrit.wikimedia.org/r/787741 (owner: 10Jbond) [11:37:40] (03CR) 10David Caro: Remove the puppet lvm module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) (owner: 10Alexandros Kosiaris) [11:37:45] (Storage /var over 50%) firing: Alert for device asw1-b12-drmrs.mgmt.drmrs.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [11:38:07] (03CR) 10David Caro: lvm::volume: add createonly flag and use in cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [11:39:53] (03CR) 10Jbond: [C: 03+2] Revert "C:ssh::publish_fingerprints: update to use new_query fac..." [puppet] - 10https://gerrit.wikimedia.org/r/787740 (owner: 10Jbond) [11:40:54] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Wrap starting replication with finally [software] - 10https://gerrit.wikimedia.org/r/775895 (owner: 10Ladsgroup) [11:41:26] (03Merged) 10jenkins-bot: auto_schema: Wrap starting replication with finally [software] - 10https://gerrit.wikimedia.org/r/775895 (owner: 10Ladsgroup) [11:42:45] (Storage /var over 50%) firing: (2) Device asw1-b12-drmrs.mgmt.drmrs.wmnet recovered from Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [11:43:50] (03CR) 10David Caro: lvm::volume: add createonly flag and use in cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [11:47:45] (Storage /var over 50%) resolved: Device asw1-b13-drmrs.mgmt.drmrs.wmnet recovered from Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [11:49:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host gitlab1003.mgmt.eqiad.wmnet with reboot policy FORCED [11:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host gitlab1004.mgmt.eqiad.wmnet with reboot policy FORCED [11:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host gitlab-runner1002.mgmt.eqiad.wmnet with reboot policy FORCED [11:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:51:19] !log dcaro@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS bullseye [11:51:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host gitlab-runner1003.mgmt.eqiad.wmnet with reboot policy FORCED [11:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudnet2005-dev.codfw.wmne... [11:56:49] (03PS1) 10Jbond: puppetdb::query_facts: try to optimize [puppet] - 10https://gerrit.wikimedia.org/r/787716 [11:57:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab1003.mgmt.eqiad.wmnet with reboot policy FORCED [11:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:31] (03CR) 10jerkins-bot: [V: 04-1] puppetdb::query_facts: try to optimize [puppet] - 10https://gerrit.wikimedia.org/r/787716 (owner: 10Jbond) [11:58:13] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:59:15] (03PS2) 10Jbond: puppetdb::query_facts: try to optimize [puppet] - 10https://gerrit.wikimedia.org/r/787716 [12:00:00] (03CR) 10jerkins-bot: [V: 04-1] puppetdb::query_facts: try to optimize [puppet] - 10https://gerrit.wikimedia.org/r/787716 (owner: 10Jbond) [12:01:43] (03PS2) 10David Caro: Move from deprecated icinga_hosts to alerting_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) [12:02:23] (03CR) 10David Caro: Move from deprecated icinga_hosts to alerting_hosts (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) (owner: 10David Caro) [12:02:37] (03PS3) 10David Caro: Move from deprecated icinga_hosts to alerting_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) [12:03:05] (03PS1) 10MVernon: install_server install coreutils so we have stat available [puppet] - 10https://gerrit.wikimedia.org/r/787717 (https://phabricator.wikimedia.org/T300057) [12:03:59] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:04:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab1004.mgmt.eqiad.wmnet with reboot policy FORCED [12:04:08] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab-runner1002.mgmt.eqiad.wmnet with reboot policy FORCED [12:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab-runner1003.mgmt.eqiad.wmnet with reboot policy FORCED [12:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:21] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:05:31] (03PS1) 10Aqu: Update analytics refine job version in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/787718 [12:09:08] 10SRE, 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10Cmjohnson) This is the sign of a failed DIMM, during post it's failing during the checking memory phase. I attempted to reboot the system to "self-heal" but that failed, The SEL shows. I will request a D... [12:09:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Cmjohnson) [12:09:59] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [12:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase-dev2001.codfw.wmnet [12:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:53] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [12:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2003.codfw.wmnet with OS bullseye [12:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:47] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS bullseye [12:16:31] 10SRE: phedenskog uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T307079 (10Peter) Thanks @fgiunchedi and @Dzahn I'll do that first thing next week. [12:16:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase-dev2001.codfw.wmnet [12:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:21] (03PS3) 10Jbond: puppetdb::query_facts: try to optimize [puppet] - 10https://gerrit.wikimedia.org/r/787716 [12:21:28] (03CR) 10Jbond: [C: 03+2] puppetdb::query_facts: try to optimize [puppet] - 10https://gerrit.wikimedia.org/r/787716 (owner: 10Jbond) [12:22:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/787717 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [12:23:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase-dev2002.codfw.wmnet [12:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:16] (03PS1) 10Jbond: puppetdb::query_facts: update return type [puppet] - 10https://gerrit.wikimedia.org/r/787719 [12:25:38] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetdb::query_facts: update return type [puppet] - 10https://gerrit.wikimedia.org/r/787719 (owner: 10Jbond) [12:27:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase-dev2002.codfw.wmnet [12:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:28:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T306560)', diff saved to https://phabricator.wikimedia.org/P27010 and previous config saved to /var/cache/conftool/dbconfig/20220429-122805-ladsgroup.json [12:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:15] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [12:28:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] Remove the puppet lvm module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) (owner: 10Alexandros Kosiaris) [12:29:01] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test2003.codfw.wmnet with reason: host reimage [12:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "Two non blocking nits, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787717 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [12:30:59] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2005-dev.codfw.wmnet with OS bullseye [12:31:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Heh, I was just reminded of this as well. https://tickets.puppetlabs.com/browse/MODULES-5307" [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) (owner: 10Alexandros Kosiaris) [12:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudnet2005-dev.codfw.wmnet wi... [12:32:20] (03CR) 10Jbond: [C: 04-1] lvm::volume: add createonly flag and use in cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [12:32:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-test2003.codfw.wmnet with reason: host reimage [12:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase-dev2003.codfw.wmnet [12:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase-dev2003.codfw.wmnet [12:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:41] (03CR) 10MVernon: install_server install coreutils so we have stat available (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787717 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [12:40:44] 10SRE: Internal PKI for secure communication - Barcelona Ops offsite 2016 - https://phabricator.wikimedia.org/T150822 (10fgiunchedi) [12:40:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T306560)', diff saved to https://phabricator.wikimedia.org/P27011 and previous config saved to /var/cache/conftool/dbconfig/20220429-124056-ladsgroup.json [12:40:58] 10SRE: Puppet CA rollover - https://phabricator.wikimedia.org/T150823 (10fgiunchedi) 05Open→03Invalid IIRC we did a "rollover" (extend certificate expiration, keeping the private key) for puppet CA, resolving this old task [12:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:03] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [12:41:06] (03PS2) 10MVernon: install_server install coreutils so we have stat available [puppet] - 10https://gerrit.wikimedia.org/r/787717 (https://phabricator.wikimedia.org/T300057) [12:43:33] 10SRE: Run systematic 2FA availability tests - https://phabricator.wikimedia.org/T151049 (10fgiunchedi) [12:44:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test2003.codfw.wmnet with OS bullseye [12:44:31] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS bullseye completed: - ganeti-test2003 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled P... [12:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10dcaro) Adding a comment here for visibility and for the future :) the cloudnet hosts should have been prepared with 'insetup... [12:47:00] (03CR) 10Filippo Giunchedi: [C: 03+1] install_server install coreutils so we have stat available (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787717 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [12:47:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10dcaro) [12:47:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10dcaro) [12:48:40] (03CR) 10Alexandros Kosiaris: [C: 04-2] lvm::volume: add createonly flag and use in cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [12:49:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4004.ulsfo.wmnet [12:49:17] 10SRE: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045 (10MoritzMuehlenhoff) 05Open→03Declined This can be closed, the old Yubikey-specific setup has been shutdown, partly replaced by the CAS setup. [12:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:22] 10SRE: logrotate failing with $FILE.1.gz: File exists - https://phabricator.wikimedia.org/T151314 (10fgiunchedi) 05Open→03Declined I couldn't find any recent instance of this problem, boldly resolving [12:49:26] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10fgiunchedi) [12:49:52] (03CR) 10David Caro: Remove the puppet lvm module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) (owner: 10Alexandros Kosiaris) [12:51:08] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10ayounsi) I believe the anchors are linked to Faidon's RIPE account. @faidon, could you take care of it? [12:51:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [12:51:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [12:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298295)', diff saved to https://phabricator.wikimedia.org/P27012 and previous config saved to /var/cache/conftool/dbconfig/20220429-125146-ladsgroup.json [12:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:56] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [12:51:56] 10SRE: Run systematic 2FA availability tests - https://phabricator.wikimedia.org/T151049 (10MoritzMuehlenhoff) 05Open→03Declined This can be closed, the old Yubikey-specific setup has been shutdown [12:51:58] 10SRE: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045 (10MoritzMuehlenhoff) [12:52:27] 10SRE, 10Documentation: Proper documentation for Yubico 2FA for production use - https://phabricator.wikimedia.org/T151050 (10MoritzMuehlenhoff) 05Open→03Declined This can be closed, the old Yubikey-specific setup has been shutdown [12:52:29] 10SRE: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045 (10MoritzMuehlenhoff) [12:52:31] 10SRE: Fully puppetise yubikey-val - https://phabricator.wikimedia.org/T151046 (10MoritzMuehlenhoff) 05Open→03Declined This can be closed, the old Yubikey-specific setup has been shutdown [12:52:33] 10SRE: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045 (10MoritzMuehlenhoff) [12:52:45] 10SRE: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045 (10MoritzMuehlenhoff) [12:53:07] 10SRE: Integrate Yubikey into data.yaml - https://phabricator.wikimedia.org/T151047 (10MoritzMuehlenhoff) 05Open→03Declined This can be closed, the old Yubikey-specific setup has been shutdown [12:53:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4004.ulsfo.wmnet [12:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P27013 and previous config saved to /var/cache/conftool/dbconfig/20220429-125601-ladsgroup.json [12:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:17] (03CR) 10David Caro: lvm::volume: add createonly flag and use in cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [12:58:05] (03CR) 10Muehlenhoff: "On a related note: The current plan is to no longer apply a default license to puppet.git at large, but rather approach this on a per modu" [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [13:00:38] (03CR) 10Alexandros Kosiaris: lvm::volume: add createonly flag and use in cinder backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [13:01:35] 10SRE: stat user crontab on stat hosts for old file removal - https://phabricator.wikimedia.org/T151317 (10fgiunchedi) 05Open→03Declined It looks like we got rid of these crons over time, resolving but let's reopen if this comes back [13:01:39] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10fgiunchedi) [13:05:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298295)', diff saved to https://phabricator.wikimedia.org/P27014 and previous config saved to /var/cache/conftool/dbconfig/20220429-130556-ladsgroup.json [13:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:03] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [13:06:50] 10SRE, 10serviceops: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10Esanders) > Within days of an LTS reaching EOL major nodejs libraries will be looking to remove support for it from their releases. Indeed, and many don't even wait for the EOL... [13:07:48] 10SRE: Add Prometheus collector for Tor - https://phabricator.wikimedia.org/T188098 (10fgiunchedi) 05Open→03Declined We're not running a Tor relay at least since Ib59edbb8e, resolving [13:09:22] PROBLEM - Check systemd state on ms-be1033 is CRITICAL: CRITICAL - degraded: The following units failed: session-326228.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P27015 and previous config saved to /var/cache/conftool/dbconfig/20220429-131106-ladsgroup.json [13:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:15] 10SRE, 10serviceops: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10MoritzMuehlenhoff) >> We can import the nodesource packages into separate repository components, e.g. thirdparty/node14 and thirdparty/node16. This way applications have the fle... [13:18:46] (03CR) 10Jbond: [C: 03+1] "+1 as we are all agreeing and the actual Cr looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [13:21:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27016 and previous config saved to /var/cache/conftool/dbconfig/20220429-132101-ladsgroup.json [13:21:02] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet [13:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:16] RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:54] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet [13:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T306560)', diff saved to https://phabricator.wikimedia.org/P27017 and previous config saved to /var/cache/conftool/dbconfig/20220429-132611-ladsgroup.json [13:26:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [13:26:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [13:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:18] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [13:26:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P27018 and previous config saved to /var/cache/conftool/dbconfig/20220429-132619-ladsgroup.json [13:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:55] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet [13:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:25] (03CR) 10David Caro: [C: 03+2] lvm::volume: add createonly flag and use in cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [13:31:53] (03PS5) 10David Caro: lvm::volume: add createonly flag and use in cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) [13:32:44] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet [13:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:13] ACKNOWLEDGEMENT - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: ayounsi T307121 - The acknowledgement expires at: 2022-05-04 13:32:47. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:35:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [13:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27019 and previous config saved to /var/cache/conftool/dbconfig/20220429-133606-ladsgroup.json [13:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:56] (03PS18) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [13:37:47] (03CR) 10Herron: "Thanks for the reviews! I think we're in a good place to try a deploy so I'll plan to do that early next week" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [13:41:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [13:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2003.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [13:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti-test2003.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [13:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:46] 10SRE: Update prometheus-varnish-exporter on debian to 1.4 - https://phabricator.wikimedia.org/T195252 (10fgiunchedi) 05Open→03Invalid Debian ships with prometheus-varnish-exporter 1.6 nowadays [13:46:49] (03CR) 10MVernon: install_server install coreutils so we have stat available (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787717 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [13:48:14] 10SRE: Upgrade Ganeti clusters to 2.15.2-7+deb9u3 - https://phabricator.wikimedia.org/T210289 (10fgiunchedi) 05Open→03Invalid Minimum ganeti version on the fleet is `2.16.0-5` nowadays, resolving [13:49:03] (03PS1) 10JMeybohm: Add -ro and -rw discovery records for k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/787747 (https://phabricator.wikimedia.org/T305358) [13:49:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787717 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [13:49:45] (03PS1) 10JMeybohm: Add -ro and -rw discovery records for k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/787748 (https://phabricator.wikimedia.org/T305358) [13:50:35] (03CR) 10MVernon: [C: 03+2] install_server install coreutils so we have stat available [puppet] - 10https://gerrit.wikimedia.org/r/787717 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [13:51:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298295)', diff saved to https://phabricator.wikimedia.org/P27020 and previous config saved to /var/cache/conftool/dbconfig/20220429-135111-ladsgroup.json [13:51:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:51:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298295)', diff saved to https://phabricator.wikimedia.org/P27021 and previous config saved to /var/cache/conftool/dbconfig/20220429-135121-ladsgroup.json [13:51:22] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [13:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:11] (03PS1) 10Luke Bowmaker: Image Suggestions Feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 [13:52:40] (03CR) 10jerkins-bot: [V: 04-1] Image Suggestions Feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [13:53:25] 10SRE: /dev/log symlink to /run/systemd/journal/dev-log disappeared on kubernetes1001 - https://phabricator.wikimedia.org/T212681 (10fgiunchedi) 05Open→03Declined I couldn't find any more occurrences of this, tentatively resolving [13:53:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298295)', diff saved to https://phabricator.wikimedia.org/P27022 and previous config saved to /var/cache/conftool/dbconfig/20220429-135329-ladsgroup.json [13:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P27023 and previous config saved to /var/cache/conftool/dbconfig/20220429-135511-ladsgroup.json [13:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:19] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [13:55:45] 10SRE: Create cookbook to add a node to a Ganeti cluster - https://phabricator.wikimedia.org/T274527 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The cookbook has been written and is in active use for several months, various fine-tuning has been done to it (and will continue to get applied... [13:55:47] 10SRE: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff) [13:56:25] (03PS2) 10Luke Bowmaker: Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 [13:56:50] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10akosiaris) > As a starting point: @jhathaway noted that we're running ffmpeg at niceness -19, which is quite assertive; raising that v... [13:56:50] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2041.codfw.wmnet with OS bullseye [13:56:53] (03CR) 10jerkins-bot: [V: 04-1] Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [13:56:54] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2041.codfw.wmnet with OS bullseye [13:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add -ro and -rw discovery records for k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/787747 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [13:57:24] 10SRE, 10DC-Ops: fix IPMI over LAN on certain HP hosts - https://phabricator.wikimedia.org/T235234 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @Dzahn @Papaul I'm tentatively resolving this old task, most/all hosts mentioned here have been decom'd [13:57:26] 10SRE: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155 (10fgiunchedi) [13:57:29] 10SRE, 10Documentation: Document how to fix IPMI issues on Wikitech - https://phabricator.wikimedia.org/T191956 (10fgiunchedi) [13:57:33] 10SRE, 10observability: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160 (10fgiunchedi) [13:57:49] (03PS3) 10Luke Bowmaker: Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 [13:57:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add -ro and -rw discovery records for k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/787748 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [13:58:38] (03CR) 10jerkins-bot: [V: 04-1] Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [14:00:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "fleet wide PCC output indeed identified just cloudbackups breaking. I 'll try to fix the cloud backup stuff in a separate change." [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) (owner: 10Alexandros Kosiaris) [14:01:45] (03CR) 10JMeybohm: [C: 03+2] Add -ro and -rw discovery records for k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/787748 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [14:02:10] 10SRE, 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10fgiunchedi) p:05Triage→03Medium [14:02:19] (03PS4) 10Luke Bowmaker: Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 [14:03:08] (03CR) 10jerkins-bot: [V: 04-1] Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [14:04:06] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:05:44] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:06:13] 10SRE, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi AFAICS we're disabling agent forwarding across the board (production and wmcs), resolving [14:08:26] 10SRE, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138 (10MoritzMuehlenhoff) 05Resolved→03Open No, it's still enabled in Cloud VPS. [14:08:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27024 and previous config saved to /var/cache/conftool/dbconfig/20220429-140834-ladsgroup.json [14:08:38] (03PS1) 10JMeybohm: Add desired state for k8s-ingress-wikikube -ro and -rw discovery records [puppet] - 10https://gerrit.wikimedia.org/r/787750 (https://phabricator.wikimedia.org/T305358) [14:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P27025 and previous config saved to /var/cache/conftool/dbconfig/20220429-141017-ladsgroup.json [14:10:22] (03CR) 10JMeybohm: [C: 03+2] Add -ro and -rw discovery records for k8s-ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/787747 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [14:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:18] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I am now investigating by capturing network traffic from the eventgate-analytics-external pods and looking... [14:16:20] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2041.codfw.wmnet with reason: host reimage [14:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:16:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P27026 and previous config saved to /var/cache/conftool/dbconfig/20220429-141633-ladsgroup.json [14:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:43] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [14:17:36] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10cmooney) From the experience with the one in codfw I think the process is to delete and then re-add. If @faidon can remove our existing one we can take care of... [14:19:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P27027 and previous config saved to /var/cache/conftool/dbconfig/20220429-141902-ladsgroup.json [14:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:09] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2041.codfw.wmnet with reason: host reimage [14:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:18] (03CR) 10JMeybohm: [C: 03+2] Add desired state for k8s-ingress-wikikube -ro and -rw discovery records [puppet] - 10https://gerrit.wikimedia.org/r/787750 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [14:21:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:21:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:21:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27028 and previous config saved to /var/cache/conftool/dbconfig/20220429-142142-ladsgroup.json [14:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:56] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:23:38] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns1001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:23:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27029 and previous config saved to /var/cache/conftool/dbconfig/20220429-142339-ladsgroup.json [14:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:54] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:23:56] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns4002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P27030 and previous config saved to /var/cache/conftool/dbconfig/20220429-142523-ladsgroup.json [14:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:49] confd template stuff is me [14:26:26] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:28:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27031 and previous config saved to /var/cache/conftool/dbconfig/20220429-142806-ladsgroup.json [14:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:29:08] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns1001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:29:10] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:29:42] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-wikikube-ro [14:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:38] 10ops-drmrs, 10ops-esams, 10Infrastructure-Foundations, 10netops: drmrs-esams wave provisioning - https://phabricator.wikimedia.org/T307221 (10ayounsi) [14:31:44] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:31:46] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:31:46] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns4001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:31:57] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [14:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:27] 10ops-drmrs, 10ops-esams, 10Infrastructure-Foundations, 10netops: drmrs-esams wave provisioning - https://phabricator.wikimedia.org/T307221 (10ayounsi) [14:32:33] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2041.codfw.wmnet with OS bullseye [14:32:36] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2041.codfw.wmnet with OS bullseye completed: - ms-be2041 (**PASS**) - Downtim... [14:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P27032 and previous config saved to /var/cache/conftool/dbconfig/20220429-143407-ladsgroup.json [14:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:32] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:34:32] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns6001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:35:04] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:33] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10MatthewVernon) [14:36:46] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on authdns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:37:01] 10SRE, 10SRE-swift-storage: reimaging swift backends should set swift UID/GID to match filesystems - https://phabricator.wikimedia.org/T300057 (10MatthewVernon) 05Open→03Resolved Revised version (using coreutils' `stat` from `/target`) worked with ms-be2041. [14:37:08] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:38:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298295)', diff saved to https://phabricator.wikimedia.org/P27033 and previous config saved to /var/cache/conftool/dbconfig/20220429-143844-ladsgroup.json [14:38:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:38:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:38:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:51] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [14:38:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298295)', diff saved to https://phabricator.wikimedia.org/P27034 and previous config saved to /var/cache/conftool/dbconfig/20220429-143857-ladsgroup.json [14:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:00] 10SRE: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10fgiunchedi) AFAICT nowadays `zone_validator.py` will fail on duplicate records. Ok to resolve this @Volans ? [14:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:24] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:39:40] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns1002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:39:42] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns6002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:40:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P27035 and previous config saved to /var/cache/conftool/dbconfig/20220429-144028-ladsgroup.json [14:40:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:40:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [14:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:36] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [14:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:58] (03PS1) 10JMeybohm: Add -ro and -rw discovery records for k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/787752 (https://phabricator.wikimedia.org/T305358) [14:41:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298295)', diff saved to https://phabricator.wikimedia.org/P27036 and previous config saved to /var/cache/conftool/dbconfig/20220429-144105-ladsgroup.json [14:41:08] 10SRE, 10ops-esams: wipe backup-array1 - https://phabricator.wikimedia.org/T237041 (10fgiunchedi) @Papaul @robh I couldn't find any trace of this host in netbox, has it been decom'd and thus we can close the task ? [14:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:55] (03CR) 10JMeybohm: [C: 03+2] Add -ro and -rw discovery records for k8s-ingress-wikikube [puppet] - 10https://gerrit.wikimedia.org/r/787752 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [14:42:12] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:42:14] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns3002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:42:16] PROBLEM - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:42:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:52] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:43:00] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:06] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:06] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns6001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:08] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:43:10] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-wikikube-ro [14:43:10] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on authdns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27037 and previous config saved to /var/cache/conftool/dbconfig/20220429-144311-ladsgroup.json [14:43:12] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:16] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns4002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:16] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns4002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:16] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns1002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:22] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns6002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:22] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:26] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:28] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns4001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:34] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:34] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-wikikube-rw,name=eqiad [14:43:36] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:56] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:44:10] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns1002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:44:12] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-ro.state on dns3002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:44:12] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:44:20] RECOVERY - Confd template for /var/lib/gdnsd/discovery-k8s-ingress-wikikube-rw.state on dns6001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:44:31] sorry... [14:44:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:44:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [14:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:09] no worries jayme, it happens [14:48:23] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) @Cmjohnson if we wait until T305570 is complete, we should have the capacity to do it whenever is convenient -and provided we do it one machine at a time- without an... [14:49:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P27038 and previous config saved to /var/cache/conftool/dbconfig/20220429-144912-ladsgroup.json [14:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:49:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T306560)', diff saved to https://phabricator.wikimedia.org/P27039 and previous config saved to /var/cache/conftool/dbconfig/20220429-144947-ladsgroup.json [14:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:57] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [14:50:37] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10MoritzMuehlenhoff) [14:51:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [14:51:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [14:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P27040 and previous config saved to /var/cache/conftool/dbconfig/20220429-145148-ladsgroup.json [14:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host gitlab-runner1004.mgmt.eqiad.wmnet with reboot policy FORCED [14:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:57] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10MoritzMuehlenhoff) The following upgrade steps were done in the Ganeti test cluster for the 3.0 update: We'll be keeping the "kvm:machine_version=pc-i440fx-2.8" KVM machine type which was applied as part of the buster update f... [14:56:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27041 and previous config saved to /var/cache/conftool/dbconfig/20220429-145610-ladsgroup.json [14:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T306560)', diff saved to https://phabricator.wikimedia.org/P27042 and previous config saved to /var/cache/conftool/dbconfig/20220429-145730-ladsgroup.json [14:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:37] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [14:58:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27043 and previous config saved to /var/cache/conftool/dbconfig/20220429-145816-ladsgroup.json [14:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:15] (03PS1) 10JMeybohm: Switch miscweb and datahub-gms to new discovery records [dns] - 10https://gerrit.wikimedia.org/r/787753 (https://phabricator.wikimedia.org/T305358) [15:04:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P27044 and previous config saved to /var/cache/conftool/dbconfig/20220429-150417-ladsgroup.json [15:04:21] (03PS1) 10Stang: zhwiki: Update zh-hans version tagline and wordmark files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787754 (https://phabricator.wikimedia.org/T276694) [15:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:24] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [15:04:41] 10SRE, 10Traffic-Icebox: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10fgiunchedi) We have `test-lb` nowadays, ok to resolve @BBlack or there's sth missing ? [15:04:46] (03CR) 10JMeybohm: [C: 03+2] Switch miscweb and datahub-gms to new discovery records [dns] - 10https://gerrit.wikimedia.org/r/787753 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [15:06:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P27045 and previous config saved to /var/cache/conftool/dbconfig/20220429-150619-ladsgroup.json [15:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab-runner1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:35] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2042.codfw.wmnet with OS bullseye [15:07:38] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2042.codfw.wmnet with OS bullseye [15:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27046 and previous config saved to /var/cache/conftool/dbconfig/20220429-151115-ladsgroup.json [15:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:50] (03PS1) 10JMeybohm: Remove k8s-ingress-wikikube.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/787756 (https://phabricator.wikimedia.org/T305358) [15:12:09] (03PS1) 10Cwhite: Revert "beta-logs: temporarily undefine cluster jobs_host" [puppet] - 10https://gerrit.wikimedia.org/r/787773 [15:12:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27047 and previous config saved to /var/cache/conftool/dbconfig/20220429-151235-ladsgroup.json [15:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:59] 10SRE, 10ops-esams: wipe backup-array1 - https://phabricator.wikimedia.org/T237041 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving as per @Papaul this host is no longer ` 15:06 godog: hey looks like you are doing some tasks clean up thanks for that for T237041 i thin... [15:13:21] (03CR) 10Cwhite: [C: 03+2] Revert "beta-logs: temporarily undefine cluster jobs_host" [puppet] - 10https://gerrit.wikimedia.org/r/787773 (owner: 10Cwhite) [15:13:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27048 and previous config saved to /var/cache/conftool/dbconfig/20220429-151321-ladsgroup.json [15:13:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Cmjohnson) >>! In T301177#7886110, @Dzahn wrote: > confirming that the "gitlab" hosts should use a public I... [15:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:13:46] (03PS3) 10Cwhite: opensearch: ensure curator is >=5.8.1 [puppet] - 10https://gerrit.wikimedia.org/r/787109 (https://phabricator.wikimedia.org/T301017) [15:14:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:14:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:14:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:14:21] (03CR) 10Cwhite: opensearch: ensure curator is >=5.8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787109 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [15:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27049 and previous config saved to /var/cache/conftool/dbconfig/20220429-151424-ladsgroup.json [15:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ayounsi) @Andrew, thanks, I'm still in my quest of reducing our public vlans usage ;) Could those hosts use private IPs (li... [15:15:02] (03PS2) 10Stang: zhwiki: Update zh-hans version tagline and wordmark files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787754 (https://phabricator.wikimedia.org/T276694) [15:15:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787109 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [15:16:34] (03PS1) 10JMeybohm: Remove k8s-ingress-wikikube.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787757 (https://phabricator.wikimedia.org/T305358) [15:16:44] (03PS5) 10Luke Bowmaker: Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 [15:19:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) a:05ayounsi→03None [15:20:16] (03PS1) 10Cmjohnson: adding gitlab and gitlab runner hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787758 (https://phabricator.wikimedia.org/T301177) [15:20:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27050 and previous config saved to /var/cache/conftool/dbconfig/20220429-152046-ladsgroup.json [15:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:21:09] (03CR) 10Cmjohnson: [C: 03+2] adding gitlab and gitlab runner hosts to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787758 (https://phabricator.wikimedia.org/T301177) (owner: 10Cmjohnson) [15:21:22] (03CR) 10Luke Bowmaker: "Hi Andrew," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [15:21:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P27051 and previous config saved to /var/cache/conftool/dbconfig/20220429-152124-ladsgroup.json [15:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:49] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2042.codfw.wmnet with reason: host reimage [15:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, and 2 others: Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Cmjohnson) [15:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:22:48] (03PS2) 10JMeybohm: Remove k8s-ingress-wikikube.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787757 (https://phabricator.wikimedia.org/T305358) [15:22:50] (03PS1) 10JMeybohm: trafficserver: Switch datahub to new k8s-ingress-wikikube discovery [puppet] - 10https://gerrit.wikimedia.org/r/787759 (https://phabricator.wikimedia.org/T305358) [15:24:59] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:07] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2042.codfw.wmnet with reason: host reimage [15:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [15:26:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298295)', diff saved to https://phabricator.wikimedia.org/P27052 and previous config saved to /var/cache/conftool/dbconfig/20220429-152620-ladsgroup.json [15:26:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [15:26:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [15:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:27] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [15:26:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298295)', diff saved to https://phabricator.wikimedia.org/P27053 and previous config saved to /var/cache/conftool/dbconfig/20220429-152628-ladsgroup.json [15:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] helmfile.d: add developer-portal [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [15:27:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27054 and previous config saved to /var/cache/conftool/dbconfig/20220429-152740-ladsgroup.json [15:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:35] (03CR) 10JMeybohm: [C: 03+2] trafficserver: Switch datahub to new k8s-ingress-wikikube discovery [puppet] - 10https://gerrit.wikimedia.org/r/787759 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [15:28:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298295)', diff saved to https://phabricator.wikimedia.org/P27055 and previous config saved to /var/cache/conftool/dbconfig/20220429-152836-ladsgroup.json [15:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:06] !log update NIC firmware for backup1002 T286722 T305446 [15:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:13] T305446: Upgrade backup* hosts to bullseye - https://phabricator.wikimedia.org/T305446 [15:29:13] T286722: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 [15:31:03] (03CR) 10Alexandros Kosiaris: "I fear we don't have the cycles to do a more thorough review of this one right now. Given the hackathon timeline we probably want to move " [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [15:31:10] (03CR) 10Cwhite: [C: 03+2] opensearch: ensure curator is >=5.8.1 [puppet] - 10https://gerrit.wikimedia.org/r/787109 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [15:31:35] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:40] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27056 and previous config saved to /var/cache/conftool/dbconfig/20220429-153551-ladsgroup.json [15:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P27057 and previous config saved to /var/cache/conftool/dbconfig/20220429-153629-ladsgroup.json [15:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:35] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [15:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:21] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:46] (03PS1) 10Cwhite: Revert "opensearch: ensure curator is >=5.8.1" [puppet] - 10https://gerrit.wikimedia.org/r/787774 [15:38:37] (03CR) 10Cwhite: [C: 03+2] Revert "opensearch: ensure curator is >=5.8.1" [puppet] - 10https://gerrit.wikimedia.org/r/787774 (owner: 10Cwhite) [15:39:37] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:20] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:45] 10SRE, 10ops-eqiad, 10DBA: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 (10Kormat) I've set the host to 'failed' in netbox: https://netbox.wikimedia.org/dcim/devices/2999/ [15:42:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T306560)', diff saved to https://phabricator.wikimedia.org/P27058 and previous config saved to /var/cache/conftool/dbconfig/20220429-154245-ladsgroup.json [15:42:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [15:42:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [15:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:52] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [15:42:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T306560)', diff saved to https://phabricator.wikimedia.org/P27059 and previous config saved to /var/cache/conftool/dbconfig/20220429-154253-ladsgroup.json [15:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27060 and previous config saved to /var/cache/conftool/dbconfig/20220429-154341-ladsgroup.json [15:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:50:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27061 and previous config saved to /var/cache/conftool/dbconfig/20220429-155057-ladsgroup.json [15:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P27062 and previous config saved to /var/cache/conftool/dbconfig/20220429-155134-ladsgroup.json [15:51:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:51:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:41] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [15:51:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P27063 and previous config saved to /var/cache/conftool/dbconfig/20220429-155142-ladsgroup.json [15:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:29] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:43] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks.py: make slightly better at handling codfw1dev (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/786307 (owner: 10Andrew Bogott) [15:58:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27064 and previous config saved to /var/cache/conftool/dbconfig/20220429-155846-ladsgroup.json [15:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:21] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:47] (03PS1) 10Andrew Bogott: wmcs-novastats/wmcs-novastats-dnsleaks.py: minor fix to .svc exclusion [puppet] - 10https://gerrit.wikimedia.org/r/787765 [16:02:16] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats/wmcs-novastats-dnsleaks.py: minor fix to .svc exclusion [puppet] - 10https://gerrit.wikimedia.org/r/787765 (owner: 10Andrew Bogott) [16:02:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bullseye [16:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host gitlab... [16:03:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner1003.eqiad.wmnet with OS bullseye [16:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host gitlab... [16:04:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P27065 and previous config saved to /var/cache/conftool/dbconfig/20220429-160520-ladsgroup.json [16:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:26] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:06:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27066 and previous config saved to /var/cache/conftool/dbconfig/20220429-160602-ladsgroup.json [16:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:09] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:06:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:06:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:06:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27067 and previous config saved to /var/cache/conftool/dbconfig/20220429-160702-ladsgroup.json [16:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:03] (03PS1) 10Cmjohnson: add gitlab-runner1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787787 (https://phabricator.wikimedia.org/T301177) [16:09:34] (03PS2) 10Cmjohnson: add gitlab-runner1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787787 (https://phabricator.wikimedia.org/T301177) [16:10:19] (03CR) 10Cmjohnson: [C: 03+2] add gitlab-runner1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787787 (https://phabricator.wikimedia.org/T301177) (owner: 10Cmjohnson) [16:12:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab-runner1004.eqiad.wmnet with OS bullseye [16:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, and 2 others: Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host gitlab-runner1004.eqi... [16:13:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27068 and previous config saved to /var/cache/conftool/dbconfig/20220429-161323-ladsgroup.json [16:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:13:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage [16:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298295)', diff saved to https://phabricator.wikimedia.org/P27069 and previous config saved to /var/cache/conftool/dbconfig/20220429-161352-ladsgroup.json [16:13:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:13:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:59] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [16:14:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T298295)', diff saved to https://phabricator.wikimedia.org/P27070 and previous config saved to /var/cache/conftool/dbconfig/20220429-161400-ladsgroup.json [16:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage [16:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab1003.wikimedia.org with OS bullseye [16:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, and 2 others: Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host gitlab1003.wikimedia.... [16:16:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298295)', diff saved to https://phabricator.wikimedia.org/P27071 and previous config saved to /var/cache/conftool/dbconfig/20220429-161610-ladsgroup.json [16:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:10] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage [16:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab1004.wikimedia.org with OS bullseye [16:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, and 2 others: Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host gitlab1004.wikimedia.... [16:19:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage [16:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P27072 and previous config saved to /var/cache/conftool/dbconfig/20220429-162025-ladsgroup.json [16:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage [16:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [16:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage [16:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27073 and previous config saved to /var/cache/conftool/dbconfig/20220429-162828-ladsgroup.json [16:28:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bullseye [16:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, and 2 others: Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host gitlab-runner1002.eqiad.w... [16:29:08] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS bullseye [16:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [16:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [16:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:58] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2042.codfw.wmnet with OS bullseye [16:30:02] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2042.codfw.wmnet with OS bullseye completed: - ms-be2042 (**PASS**) - Downtim... [16:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:12] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS buster [16:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27074 and previous config saved to /var/cache/conftool/dbconfig/20220429-163115-ladsgroup.json [16:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1003.eqiad.wmnet with OS bullseye [16:31:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T306560)', diff saved to https://phabricator.wikimedia.org/P27075 and previous config saved to /var/cache/conftool/dbconfig/20220429-163135-ladsgroup.json [16:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:42] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:31:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host gitlab-run... [16:32:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [16:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) >>! In T301177#7891791, @Cmjohnson wrote: >>>! In T301177#7886110, @Dzahn wrote: >> confirming that... [16:35:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P27076 and previous config saved to /var/cache/conftool/dbconfig/20220429-163530-ladsgroup.json [16:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1004.eqiad.wmnet with OS bullseye [16:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host gitlab-run... [16:41:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1003.wikimedia.org with OS bullseye [16:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host gitlab1003... [16:43:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27077 and previous config saved to /var/cache/conftool/dbconfig/20220429-164333-ladsgroup.json [16:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1004.wikimedia.org with OS bullseye [16:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host gitlab1004... [16:44:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Cmjohnson) [16:46:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27078 and previous config saved to /var/cache/conftool/dbconfig/20220429-164620-ladsgroup.json [16:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Cmjohnson) 05Open→03Resolved @Dzahn These have all been installed and resolving the task [16:46:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27079 and previous config saved to /var/cache/conftool/dbconfig/20220429-164640-ladsgroup.json [16:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:40] (03PS1) 10Vivian Rook: adding rook removing mdipietro [labs/private] - 10https://gerrit.wikimedia.org/r/787789 [16:49:01] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] adding rook removing mdipietro [labs/private] - 10https://gerrit.wikimedia.org/r/787789 (owner: 10Vivian Rook) [16:50:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T306560)', diff saved to https://phabricator.wikimedia.org/P27080 and previous config saved to /var/cache/conftool/dbconfig/20220429-165035-ladsgroup.json [16:50:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:50:42] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:50:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [16:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [16:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:53:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:53:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T276292)', diff saved to https://phabricator.wikimedia.org/P27081 and previous config saved to /var/cache/conftool/dbconfig/20220429-165333-ladsgroup.json [16:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:48] T276292: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 [16:56:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T276292)', diff saved to https://phabricator.wikimedia.org/P27082 and previous config saved to /var/cache/conftool/dbconfig/20220429-165613-ladsgroup.json [16:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:31] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: swap remaining ldap-labs names to ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/786265 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [16:58:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27083 and previous config saved to /var/cache/conftool/dbconfig/20220429-165839-ladsgroup.json [16:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:59:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:59:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [16:59:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27084 and previous config saved to /var/cache/conftool/dbconfig/20220429-165939-ladsgroup.json [16:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298295)', diff saved to https://phabricator.wikimedia.org/P27085 and previous config saved to /var/cache/conftool/dbconfig/20220429-170125-ladsgroup.json [17:01:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:01:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:33] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [17:01:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [17:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [17:01:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [17:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [17:01:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27086 and previous config saved to /var/cache/conftool/dbconfig/20220429-170145-ladsgroup.json [17:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:01:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [17:01:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:01:56] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS buster [17:01:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [17:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:02:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [17:02:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [17:02:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T306560)', diff saved to https://phabricator.wikimedia.org/P27087 and previous config saved to /var/cache/conftool/dbconfig/20220429-170205-ladsgroup.json [17:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:55] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:05:23] (03PS1) 10Andrew Bogott: codfw1dev: standardize ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/787790 [17:05:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27088 and previous config saved to /var/cache/conftool/dbconfig/20220429-170559-ladsgroup.json [17:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:07:18] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: standardize ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/787790 (owner: 10Andrew Bogott) [17:08:20] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Well, this is a bit confusing. I've examined packet captures from two pods in eqiad and another in codfw.... [17:11:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P27089 and previous config saved to /var/cache/conftool/dbconfig/20220429-171118-ladsgroup.json [17:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298295)', diff saved to https://phabricator.wikimedia.org/P27090 and previous config saved to /var/cache/conftool/dbconfig/20220429-171318-ladsgroup.json [17:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:25] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [17:13:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T306560)', diff saved to https://phabricator.wikimedia.org/P27091 and previous config saved to /var/cache/conftool/dbconfig/20220429-171339-ladsgroup.json [17:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:47] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:16:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T306560)', diff saved to https://phabricator.wikimedia.org/P27092 and previous config saved to /var/cache/conftool/dbconfig/20220429-171650-ladsgroup.json [17:16:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [17:16:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [17:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T306560)', diff saved to https://phabricator.wikimedia.org/P27093 and previous config saved to /var/cache/conftool/dbconfig/20220429-171658-ladsgroup.json [17:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27094 and previous config saved to /var/cache/conftool/dbconfig/20220429-172104-ladsgroup.json [17:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:25] (03PS1) 10Andrew Bogott: wmfkeystonehooks: modest improvement to exception handling [puppet] - 10https://gerrit.wikimedia.org/r/787791 [17:23:11] PROBLEM - Check systemd state on ms-be1054 is CRITICAL: CRITICAL - degraded: The following units failed: session-326117.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:14] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: modest improvement to exception handling [puppet] - 10https://gerrit.wikimedia.org/r/787791 (owner: 10Andrew Bogott) [17:25:05] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:26:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P27095 and previous config saved to /var/cache/conftool/dbconfig/20220429-172623-ladsgroup.json [17:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27096 and previous config saved to /var/cache/conftool/dbconfig/20220429-172823-ladsgroup.json [17:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P27097 and previous config saved to /var/cache/conftool/dbconfig/20220429-172845-ladsgroup.json [17:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:58] (03PS1) 10Tchanders: Add QuickSurveys survey for the SimilarEditors feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) [17:31:54] (03CR) 10Tchanders: "This can be tested locally by adding the config to LocalSettings.php and pulling down I0b508faf3445c6e7caffc964a5aa67231a01da9b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787793 (https://phabricator.wikimedia.org/T307025) (owner: 10Tchanders) [17:35:46] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I have a few errors logged by ats-be attempting to connect to `eventgate-analytics-external.discovery.wmne... [17:36:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27098 and previous config saved to /var/cache/conftool/dbconfig/20220429-173609-ladsgroup.json [17:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:48] !log killed bnwiki's refresh links recommendation (T299021) [17:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:54] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [17:39:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:39:25] (03PS1) 10Andrew Bogott: wmfkeystonehooks: raise loglevel for ldap failures [puppet] - 10https://gerrit.wikimedia.org/r/787794 [17:39:43] (03CR) 10Ottomata: Image Suggestions Feedback Stream (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [17:40:03] (03CR) 10jerkins-bot: [V: 04-1] wmfkeystonehooks: raise loglevel for ldap failures [puppet] - 10https://gerrit.wikimedia.org/r/787794 (owner: 10Andrew Bogott) [17:41:22] (03PS2) 10Andrew Bogott: wmfkeystonehooks: raise loglevel for ldap failures [puppet] - 10https://gerrit.wikimedia.org/r/787794 [17:41:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T276292)', diff saved to https://phabricator.wikimedia.org/P27099 and previous config saved to /var/cache/conftool/dbconfig/20220429-174129-ladsgroup.json [17:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:36] T276292: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 [17:41:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T306560)', diff saved to https://phabricator.wikimedia.org/P27100 and previous config saved to /var/cache/conftool/dbconfig/20220429-174136-ladsgroup.json [17:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:43] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:43:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27101 and previous config saved to /var/cache/conftool/dbconfig/20220429-174328-ladsgroup.json [17:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P27102 and previous config saved to /var/cache/conftool/dbconfig/20220429-174350-ladsgroup.json [17:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:03] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: raise loglevel for ldap failures [puppet] - 10https://gerrit.wikimedia.org/r/787794 (owner: 10Andrew Bogott) [17:51:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27103 and previous config saved to /var/cache/conftool/dbconfig/20220429-175114-ladsgroup.json [17:51:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:51:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:51:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P27104 and previous config saved to /var/cache/conftool/dbconfig/20220429-175122-ladsgroup.json [17:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:12] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) > perhaps this is a client browser opening a connection but sending an empty POST body This seems likely,... [17:56:12] (03PS6) 10Luke Bowmaker: Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 [17:56:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27105 and previous config saved to /var/cache/conftool/dbconfig/20220429-175642-ladsgroup.json [17:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P27106 and previous config saved to /var/cache/conftool/dbconfig/20220429-175757-ladsgroup.json [17:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:04] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:58:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298295)', diff saved to https://phabricator.wikimedia.org/P27107 and previous config saved to /var/cache/conftool/dbconfig/20220429-175833-ladsgroup.json [17:58:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:58:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:41] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [17:58:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298295)', diff saved to https://phabricator.wikimedia.org/P27108 and previous config saved to /var/cache/conftool/dbconfig/20220429-175841-ladsgroup.json [17:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T306560)', diff saved to https://phabricator.wikimedia.org/P27109 and previous config saved to /var/cache/conftool/dbconfig/20220429-175855-ladsgroup.json [17:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [17:58:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [17:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:02] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:59:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T306560)', diff saved to https://phabricator.wikimedia.org/P27110 and previous config saved to /var/cache/conftool/dbconfig/20220429-175903-ladsgroup.json [17:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298295)', diff saved to https://phabricator.wikimedia.org/P27111 and previous config saved to /var/cache/conftool/dbconfig/20220429-175951-ladsgroup.json [17:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:07] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:17] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:11:30] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1040.eqiad.wmnet [18:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T306560)', diff saved to https://phabricator.wikimedia.org/P27112 and previous config saved to /var/cache/conftool/dbconfig/20220429-181145-ladsgroup.json [18:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:52] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:11:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27113 and previous config saved to /var/cache/conftool/dbconfig/20220429-181153-ladsgroup.json [18:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27114 and previous config saved to /var/cache/conftool/dbconfig/20220429-181302-ladsgroup.json [18:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27115 and previous config saved to /var/cache/conftool/dbconfig/20220429-181456-ladsgroup.json [18:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:23] RECOVERY - Disk space on ms-be1040 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops [18:21:37] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1040.eqiad.wmnet [18:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:59] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sdc1.mount,srv-swift\x2dstorage-sdf1.mount,srv-swift\x2dstorage-sdh1.mount,srv-swift\x2dstorage-sdk1.mount,srv-swift\x2dstorage-sdm1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:55] ACKNOWLEDGEMENT - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sdc1.mount,srv-swift\x2dstorage-sdf1.mount,srv-swift\x2dstorage-sdh1.mount,srv-swift\x2dstorage-sdk1.mount,srv-swift\x2dstorage-sdm1.mount MVernon filesystems are sad system is attempting repair. - The acknowledgement expires at: 2022-05-03 10:25:05. https://wikitech.wikimedia.org/wiki/Monitoring/check_syst [18:25:55] e [18:26:15] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P27116 and previous config saved to /var/cache/conftool/dbconfig/20220429-182653-ladsgroup.json [18:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T306560)', diff saved to https://phabricator.wikimedia.org/P27117 and previous config saved to /var/cache/conftool/dbconfig/20220429-182700-ladsgroup.json [18:27:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [18:27:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [18:27:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:08] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:27:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T306560)', diff saved to https://phabricator.wikimedia.org/P27118 and previous config saved to /var/cache/conftool/dbconfig/20220429-182714-ladsgroup.json [18:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27119 and previous config saved to /var/cache/conftool/dbconfig/20220429-182807-ladsgroup.json [18:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27120 and previous config saved to /var/cache/conftool/dbconfig/20220429-183001-ladsgroup.json [18:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:52] (03PS5) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [18:31:53] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [18:32:57] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:37:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [18:37:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [18:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:38:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:38:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [18:39:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [18:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P27121 and previous config saved to /var/cache/conftool/dbconfig/20220429-184200-ladsgroup.json [18:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P27122 and previous config saved to /var/cache/conftool/dbconfig/20220429-184313-ladsgroup.json [18:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:44:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [18:44:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [18:44:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27123 and previous config saved to /var/cache/conftool/dbconfig/20220429-184411-ladsgroup.json [18:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298295)', diff saved to https://phabricator.wikimedia.org/P27124 and previous config saved to /var/cache/conftool/dbconfig/20220429-184506-ladsgroup.json [18:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:13] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [18:48:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [18:48:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [18:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27125 and previous config saved to /var/cache/conftool/dbconfig/20220429-185034-ladsgroup.json [18:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:50:50] (03PS6) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [18:51:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T306560)', diff saved to https://phabricator.wikimedia.org/P27126 and previous config saved to /var/cache/conftool/dbconfig/20220429-185109-ladsgroup.json [18:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:16] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:51:26] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [18:56:22] (03PS7) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [18:57:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T306560)', diff saved to https://phabricator.wikimedia.org/P27127 and previous config saved to /var/cache/conftool/dbconfig/20220429-185705-ladsgroup.json [18:57:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:57:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:13] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:36] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:00:44] (03PS8) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [19:01:44] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:03:15] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:05:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27128 and previous config saved to /var/cache/conftool/dbconfig/20220429-190539-ladsgroup.json [19:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27129 and previous config saved to /var/cache/conftool/dbconfig/20220429-190614-ladsgroup.json [19:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:12] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [19:08:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:08:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:04] (03PS9) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [19:18:30] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:19:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [19:19:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [19:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T306560)', diff saved to https://phabricator.wikimedia.org/P27130 and previous config saved to /var/cache/conftool/dbconfig/20220429-191932-ladsgroup.json [19:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:44] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:19:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:20:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27131 and previous config saved to /var/cache/conftool/dbconfig/20220429-192044-ladsgroup.json [19:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27132 and previous config saved to /var/cache/conftool/dbconfig/20220429-192119-ladsgroup.json [19:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:25:54] (03PS1) 10Papaul: Add new aqs node to site.pp and to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/787812 (https://phabricator.wikimedia.org/T305568) [19:26:35] (03PS10) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [19:27:11] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:29:35] (03CR) 10Papaul: [C: 03+2] Add new aqs node to site.pp and to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/787812 (https://phabricator.wikimedia.org/T305568) (owner: 10Papaul) [19:29:47] (03PS2) 10Papaul: Add new aqs node to site.pp and to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/787812 (https://phabricator.wikimedia.org/T305568) [19:31:21] (03PS11) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [19:32:03] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:32:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T306560)', diff saved to https://phabricator.wikimedia.org/P27133 and previous config saved to /var/cache/conftool/dbconfig/20220429-193230-ladsgroup.json [19:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:38] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:33:32] (03PS12) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [19:33:34] (03PS1) 10Cwhite: opensearch: enable curator version override [puppet] - 10https://gerrit.wikimedia.org/r/787816 (https://phabricator.wikimedia.org/T301017) [19:34:07] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:34:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) > I'm also interested in knowing how they work, will both servers be a redundant pair? if so how does failover work... [19:35:48] (03CR) 10Cwhite: [C: 03+2] "PCC NOOP https://puppet-compiler.wmflabs.org/pcc-worker1003/35011/" [puppet] - 10https://gerrit.wikimedia.org/r/787816 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [19:35:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27134 and previous config saved to /var/cache/conftool/dbconfig/20220429-193549-ladsgroup.json [19:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:57] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:36:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T306560)', diff saved to https://phabricator.wikimedia.org/P27135 and previous config saved to /var/cache/conftool/dbconfig/20220429-193624-ladsgroup.json [19:36:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [19:36:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [19:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2001.codfw.wmnet with OS bullseye [19:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [19:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [19:36:41] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2001.codfw.wmnet with OS bullseye [19:36:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:36:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27136 and previous config saved to /var/cache/conftool/dbconfig/20220429-193649-ladsgroup.json [19:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:23] (03PS13) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [19:37:57] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:38:46] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10Dzahn) [19:39:15] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10Dzahn) [19:41:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:41:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T306560)', diff saved to https://phabricator.wikimedia.org/P27137 and previous config saved to /var/cache/conftool/dbconfig/20220429-194122-ladsgroup.json [19:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:32] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:43:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27138 and previous config saved to /var/cache/conftool/dbconfig/20220429-194308-ladsgroup.json [19:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:44:23] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Dzahn) [19:44:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) [19:44:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) Thank you @Cmjohnson We continue this on T307142 [19:47:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P27139 and previous config saved to /var/cache/conftool/dbconfig/20220429-194735-ladsgroup.json [19:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:27] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Dzahn) Thank you @Papaul. We continue on T307142 [19:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:52:03] (03PS1) 10Dzahn: add gitlab-runner role on new physical server gitlab-runner2002 [puppet] - 10https://gerrit.wikimedia.org/r/787820 (https://phabricator.wikimedia.org/T307142) [19:56:22] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10Dmantena) [19:58:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27140 and previous config saved to /var/cache/conftool/dbconfig/20220429-195813-ladsgroup.json [19:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P27141 and previous config saved to /var/cache/conftool/dbconfig/20220429-200240-ladsgroup.json [20:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:03] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2001.codfw.wmnet with OS bullseye [20:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:11] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2001.codfw.wmnet with OS bullseye executed wi... [20:11:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2001.codfw.wmnet with OS bullseye [20:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:35] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2001.codfw.wmnet with OS bullseye [20:11:35] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:11:57] (03PS1) 10Cwhite: opensearch: set USE_OPENSEARCH curator env variable [puppet] - 10https://gerrit.wikimedia.org/r/787824 (https://phabricator.wikimedia.org/T301017) [20:13:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27142 and previous config saved to /var/cache/conftool/dbconfig/20220429-201319-ladsgroup.json [20:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:30] (03PS14) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [20:15:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2002.codfw.wmnet with OS bullseye [20:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:35] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops, 10Patch-For-Review: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2002.codfw.wmnet with OS bullseye [20:15:39] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:16:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage [20:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:37] (03PS1) 10Cwhite: beta-logs: disable compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/787826 (https://phabricator.wikimedia.org/T301017) [20:17:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T306560)', diff saved to https://phabricator.wikimedia.org/P27143 and previous config saved to /var/cache/conftool/dbconfig/20220429-201745-ladsgroup.json [20:17:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [20:17:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [20:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:52] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:17:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T306560)', diff saved to https://phabricator.wikimedia.org/P27144 and previous config saved to /var/cache/conftool/dbconfig/20220429-201753-ladsgroup.json [20:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage [20:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:55] 10SRE, 10ops-drmrs, 10ops-esams, 10Infrastructure-Foundations, 10netops: drmrs-esams wave provisioning - https://phabricator.wikimedia.org/T307221 (10wiki_willy) @RobH - here are the LOAs in pdf format below: {F35074530} {F35074529} Thanks, Willy [20:28:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P27145 and previous config saved to /var/cache/conftool/dbconfig/20220429-202824-ladsgroup.json [20:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:32] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:30:01] (03PS1) 10Jforrester: [Beta Cluster] Set special footer licence message for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787828 (https://phabricator.wikimedia.org/T297330) [20:30:03] (03PS1) 10Jforrester: Set special footer licence message for MediaWiki.org re. Help: pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787829 (https://phabricator.wikimedia.org/T301483) [20:30:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T306560)', diff saved to https://phabricator.wikimedia.org/P27146 and previous config saved to /var/cache/conftool/dbconfig/20220429-203045-ladsgroup.json [20:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:52] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:31:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2001.codfw.wmnet with OS bullseye [20:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:23] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2001.codfw.wmnet with OS bullseye completed: - aqs2001 (**PASS**)... [20:31:39] 10SRE, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) [20:34:34] 10SRE, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) 05Open→03Resolved >>! In T290192#7886070, @Papaul wrote: > @Dzahn i think it is best to create another task for this issue and not reopen the rack/setup task. Thanks repla... [20:34:41] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:34:42] 10SRE, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) a:05Dzahn→03Papaul [20:35:28] (03PS15) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [20:36:25] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:41:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T306560)', diff saved to https://phabricator.wikimedia.org/P27147 and previous config saved to /var/cache/conftool/dbconfig/20220429-204136-ladsgroup.json [20:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:44] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:44:13] (03PS16) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [20:44:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P27148 and previous config saved to /var/cache/conftool/dbconfig/20220429-204550-ladsgroup.json [20:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2002.codfw.wmnet with OS bullseye [20:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:28] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2002.codfw.wmnet with OS bullseye executed with errors: - aqs2002 (... [20:46:32] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:49:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:52:14] (03PS17) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [20:54:19] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm! I'd like to be around when you merge though." [puppet] - 10https://gerrit.wikimedia.org/r/779936 (https://phabricator.wikimedia.org/T305589) (owner: 10Ssingh) [20:54:21] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:56:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P27149 and previous config saved to /var/cache/conftool/dbconfig/20220429-205641-ladsgroup.json [20:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P27150 and previous config saved to /var/cache/conftool/dbconfig/20220429-210055-ladsgroup.json [21:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:25] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 (owner: 10Jbond) [21:02:26] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::haproxy: enable built-in prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/786783 (owner: 10Majavah) [21:04:29] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/787831 [21:09:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:11:33] (03CR) 10Andrew Bogott: [C: 03+1] "LGTM but let's not merge on a Friday" [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [21:11:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P27151 and previous config saved to /var/cache/conftool/dbconfig/20220429-211146-ladsgroup.json [21:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:17] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:13:40] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::rabbitmq: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/787003 (owner: 10Majavah) [21:14:32] (03PS18) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [21:16:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T306560)', diff saved to https://phabricator.wikimedia.org/P27152 and previous config saved to /var/cache/conftool/dbconfig/20220429-211601-ladsgroup.json [21:16:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [21:16:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [21:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:08] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:16:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T306560)', diff saved to https://phabricator.wikimedia.org/P27153 and previous config saved to /var/cache/conftool/dbconfig/20220429-211609-ladsgroup.json [21:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:20] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [21:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2002.codfw.wmnet with OS bullseye [21:21:33] (03PS19) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [21:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:36] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2002.codfw.wmnet with OS bullseye [21:23:21] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [21:26:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage [21:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T306560)', diff saved to https://phabricator.wikimedia.org/P27154 and previous config saved to /var/cache/conftool/dbconfig/20220429-212652-ladsgroup.json [21:26:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [21:26:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [21:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [21:26:58] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [21:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T306560)', diff saved to https://phabricator.wikimedia.org/P27155 and previous config saved to /var/cache/conftool/dbconfig/20220429-212808-ladsgroup.json [21:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage [21:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:41] (03PS20) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [21:36:29] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [21:41:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:41:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2002.codfw.wmnet with OS bullseye [21:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:19] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2002.codfw.wmnet with OS bullseye completed: - aqs2002 (**PASS**)... [21:43:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P27156 and previous config saved to /var/cache/conftool/dbconfig/20220429-214313-ladsgroup.json [21:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2003.codfw.wmnet with OS bullseye [21:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:42] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2003.codfw.wmnet with OS bullseye [21:58:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P27157 and previous config saved to /var/cache/conftool/dbconfig/20220429-215818-ladsgroup.json [21:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:44] (03CR) 10Cwhite: [C: 03+2] logstash: transform rotation frequency values to datestamp format [puppet] - 10https://gerrit.wikimedia.org/r/777882 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [22:02:07] (03PS5) 10Cwhite: logstash: transform rotation frequency values to datestamp format [puppet] - 10https://gerrit.wikimedia.org/r/777882 (https://phabricator.wikimedia.org/T305175) [22:13:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T306560)', diff saved to https://phabricator.wikimedia.org/P27158 and previous config saved to /var/cache/conftool/dbconfig/20220429-221323-ladsgroup.json [22:13:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [22:13:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [22:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:31] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:13:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T306560)', diff saved to https://phabricator.wikimedia.org/P27159 and previous config saved to /var/cache/conftool/dbconfig/20220429-221331-ladsgroup.json [22:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2004.codfw.wmnet with OS bullseye [22:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:31] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2004.codfw.wmnet with OS bullseye [22:15:38] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2004.codfw.wmnet with OS bullseye [22:15:42] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2004.codfw.wmnet with OS bullseye executed with errors: - aqs2004 (... [22:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2004.codfw.wmnet with OS bullseye [22:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:00] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2004.codfw.wmnet with OS bullseye [22:25:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2003.codfw.wmnet with OS bullseye [22:25:48] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2003.codfw.wmnet with OS bullseye executed with errors: - aqs2003 (... [22:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T306560)', diff saved to https://phabricator.wikimedia.org/P27160 and previous config saved to /var/cache/conftool/dbconfig/20220429-222620-ladsgroup.json [22:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:27] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:28:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2003.codfw.wmnet with OS bullseye [22:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:54] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2003.codfw.wmnet with OS bullseye [22:33:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2003.codfw.wmnet with reason: host reimage [22:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2003.codfw.wmnet with reason: host reimage [22:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:41:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P27161 and previous config saved to /var/cache/conftool/dbconfig/20220429-224125-ladsgroup.json [22:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2004.codfw.wmnet with OS bullseye [22:48:29] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2004.codfw.wmnet with OS bullseye executed with errors: - aqs2004 (... [22:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2003.codfw.wmnet with OS bullseye [22:49:30] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2003.codfw.wmnet with OS bullseye completed: - aqs2003 (**PASS**)... [22:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P27162 and previous config saved to /var/cache/conftool/dbconfig/20220429-225631-ladsgroup.json [22:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:11:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T306560)', diff saved to https://phabricator.wikimedia.org/P27163 and previous config saved to /var/cache/conftool/dbconfig/20220429-231136-ladsgroup.json [23:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:43] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [23:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:38:40] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:38:50] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:39:04] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:40:52] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:41:02] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:41:18] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:42:12] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:29] (03CR) 10Cwhite: [C: 03+2] beta-logs: disable compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/787826 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [23:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:51:13] (03CR) 10BryanDavis: [C: 03+1] Enable $wgFixDoubleRedirects on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio)