[00:01:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [00:01:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [00:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298563)', diff saved to https://phabricator.wikimedia.org/P27196 and previous config saved to /var/cache/conftool/dbconfig/20220502-000151-ladsgroup.json [00:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:55] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [00:03:15] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:14:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [00:14:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [00:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27197 and previous config saved to /var/cache/conftool/dbconfig/20220502-001449-ladsgroup.json [00:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:23:39] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:30:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27198 and previous config saved to /var/cache/conftool/dbconfig/20220502-003052-ladsgroup.json [00:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:57] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:42:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [00:42:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [00:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27199 and previous config saved to /var/cache/conftool/dbconfig/20220502-004222-ladsgroup.json [00:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:26] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [00:44:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27200 and previous config saved to /var/cache/conftool/dbconfig/20220502-004435-ladsgroup.json [00:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27201 and previous config saved to /var/cache/conftool/dbconfig/20220502-004557-ladsgroup.json [00:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298563)', diff saved to https://phabricator.wikimedia.org/P27202 and previous config saved to /var/cache/conftool/dbconfig/20220502-005118-ladsgroup.json [00:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:23] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [00:59:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27203 and previous config saved to /var/cache/conftool/dbconfig/20220502-005940-ladsgroup.json [00:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27204 and previous config saved to /var/cache/conftool/dbconfig/20220502-010102-ladsgroup.json [01:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27205 and previous config saved to /var/cache/conftool/dbconfig/20220502-010623-ladsgroup.json [01:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27206 and previous config saved to /var/cache/conftool/dbconfig/20220502-011445-ladsgroup.json [01:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27207 and previous config saved to /var/cache/conftool/dbconfig/20220502-011607-ladsgroup.json [01:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:21:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P27208 and previous config saved to /var/cache/conftool/dbconfig/20220502-012128-ladsgroup.json [01:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T306560)', diff saved to https://phabricator.wikimedia.org/P27209 and previous config saved to /var/cache/conftool/dbconfig/20220502-012950-ladsgroup.json [01:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:55] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [01:33:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [01:33:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [01:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27210 and previous config saved to /var/cache/conftool/dbconfig/20220502-013316-ladsgroup.json [01:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:36:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298563)', diff saved to https://phabricator.wikimedia.org/P27211 and previous config saved to /var/cache/conftool/dbconfig/20220502-013633-ladsgroup.json [01:36:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [01:36:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [01:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:38] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [01:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298563)', diff saved to https://phabricator.wikimedia.org/P27212 and previous config saved to /var/cache/conftool/dbconfig/20220502-013641-ladsgroup.json [01:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27213 and previous config saved to /var/cache/conftool/dbconfig/20220502-015028-ladsgroup.json [01:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:05:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27214 and previous config saved to /var/cache/conftool/dbconfig/20220502-020533-ladsgroup.json [02:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27215 and previous config saved to /var/cache/conftool/dbconfig/20220502-022038-ladsgroup.json [02:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298563)', diff saved to https://phabricator.wikimedia.org/P27216 and previous config saved to /var/cache/conftool/dbconfig/20220502-023429-ladsgroup.json [02:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:34] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [02:35:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P27217 and previous config saved to /var/cache/conftool/dbconfig/20220502-023543-ladsgroup.json [02:35:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [02:35:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [02:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:35:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [02:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [02:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P27218 and previous config saved to /var/cache/conftool/dbconfig/20220502-023556-ladsgroup.json [02:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:03] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:36:21] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:38:41] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:39] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27219 and previous config saved to /var/cache/conftool/dbconfig/20220502-024934-ladsgroup.json [02:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [02:59:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [02:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T306560)', diff saved to https://phabricator.wikimedia.org/P27220 and previous config saved to /var/cache/conftool/dbconfig/20220502-025930-ladsgroup.json [02:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:34] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [03:01:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T306560)', diff saved to https://phabricator.wikimedia.org/P27221 and previous config saved to /var/cache/conftool/dbconfig/20220502-030141-ladsgroup.json [03:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:04:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P27222 and previous config saved to /var/cache/conftool/dbconfig/20220502-030439-ladsgroup.json [03:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:09:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [03:09:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [03:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T306560)', diff saved to https://phabricator.wikimedia.org/P27223 and previous config saved to /var/cache/conftool/dbconfig/20220502-030958-ladsgroup.json [03:10:04] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) @Jgreen hello do you think this can be done on May the 16th? [03:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:07] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [03:11:22] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) p:05Low→03Medium [03:12:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T306560)', diff saved to https://phabricator.wikimedia.org/P27224 and previous config saved to /var/cache/conftool/dbconfig/20220502-031218-ladsgroup.json [03:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:49] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:16:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27225 and previous config saved to /var/cache/conftool/dbconfig/20220502-031646-ladsgroup.json [03:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298563)', diff saved to https://phabricator.wikimedia.org/P27226 and previous config saved to /var/cache/conftool/dbconfig/20220502-031944-ladsgroup.json [03:19:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:19:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [03:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:49] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [03:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [03:20:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [03:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298295)', diff saved to https://phabricator.wikimedia.org/P27227 and previous config saved to /var/cache/conftool/dbconfig/20220502-032011-ladsgroup.json [03:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:16] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [03:22:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298295)', diff saved to https://phabricator.wikimedia.org/P27228 and previous config saved to /var/cache/conftool/dbconfig/20220502-032229-ladsgroup.json [03:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27229 and previous config saved to /var/cache/conftool/dbconfig/20220502-032504-ladsgroup.json [03:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P27230 and previous config saved to /var/cache/conftool/dbconfig/20220502-032723-ladsgroup.json [03:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P27231 and previous config saved to /var/cache/conftool/dbconfig/20220502-033152-ladsgroup.json [03:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:00] (03PS1) 10Ladsgroup: Set testwiki to READ NEW for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787939 (https://phabricator.wikimedia.org/T306673) [03:37:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27232 and previous config saved to /var/cache/conftool/dbconfig/20220502-033735-ladsgroup.json [03:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:03] PROBLEM - Check systemd state on an-worker1103 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:09] (03CR) 10Ladsgroup: [C: 03+2] Set testwiki to READ NEW for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787939 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [03:39:39] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:39:52] (03Merged) 10jenkins-bot: Set testwiki to READ NEW for templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787939 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [03:40:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P27233 and previous config saved to /var/cache/conftool/dbconfig/20220502-034009-ladsgroup.json [03:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:12] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:787939|Set testwiki to READ NEW for templatelinks migration (T306673)]] (duration: 00m 49s) [03:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:15] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [03:42:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P27234 and previous config saved to /var/cache/conftool/dbconfig/20220502-034228-ladsgroup.json [03:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:46:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T306560)', diff saved to https://phabricator.wikimedia.org/P27235 and previous config saved to /var/cache/conftool/dbconfig/20220502-034657-ladsgroup.json [03:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:01] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [03:47:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:52:15] RECOVERY - Check systemd state on an-worker1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27236 and previous config saved to /var/cache/conftool/dbconfig/20220502-035240-ladsgroup.json [03:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:53] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:55:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P27237 and previous config saved to /var/cache/conftool/dbconfig/20220502-035514-ladsgroup.json [03:55:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [03:55:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [03:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P27238 and previous config saved to /var/cache/conftool/dbconfig/20220502-035522-ladsgroup.json [03:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T306560)', diff saved to https://phabricator.wikimedia.org/P27239 and previous config saved to /var/cache/conftool/dbconfig/20220502-035733-ladsgroup.json [03:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:37] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [03:57:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [03:57:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [03:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [03:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:57:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [03:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [03:58:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [03:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [03:58:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [03:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:58:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P27240 and previous config saved to /var/cache/conftool/dbconfig/20220502-035830-ladsgroup.json [03:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P27241 and previous config saved to /var/cache/conftool/dbconfig/20220502-040051-ladsgroup.json [04:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:01:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298295)', diff saved to https://phabricator.wikimedia.org/P27242 and previous config saved to /var/cache/conftool/dbconfig/20220502-040745-ladsgroup.json [04:07:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [04:07:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [04:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:51] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [04:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298295)', diff saved to https://phabricator.wikimedia.org/P27243 and previous config saved to /var/cache/conftool/dbconfig/20220502-040754-ladsgroup.json [04:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298295)', diff saved to https://phabricator.wikimedia.org/P27244 and previous config saved to /var/cache/conftool/dbconfig/20220502-040908-ladsgroup.json [04:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2129.codfw.wmnet with reason: Maintenance [04:10:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2129.codfw.wmnet with reason: Maintenance [04:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance [04:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance [04:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P27245 and previous config saved to /var/cache/conftool/dbconfig/20220502-041141-ladsgroup.json [04:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:15:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P27246 and previous config saved to /var/cache/conftool/dbconfig/20220502-041556-ladsgroup.json [04:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27247 and previous config saved to /var/cache/conftool/dbconfig/20220502-042646-ladsgroup.json [04:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P27248 and previous config saved to /var/cache/conftool/dbconfig/20220502-043101-ladsgroup.json [04:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27249 and previous config saved to /var/cache/conftool/dbconfig/20220502-044151-ladsgroup.json [04:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P27250 and previous config saved to /var/cache/conftool/dbconfig/20220502-044606-ladsgroup.json [04:46:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [04:46:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [04:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:11] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [04:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T306560)', diff saved to https://phabricator.wikimedia.org/P27251 and previous config saved to /var/cache/conftool/dbconfig/20220502-044614-ladsgroup.json [04:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T306560)', diff saved to https://phabricator.wikimedia.org/P27252 and previous config saved to /var/cache/conftool/dbconfig/20220502-044834-ladsgroup.json [04:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [04:54:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [04:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [04:54:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [04:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [04:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [04:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:55:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [04:55:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [04:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298295)', diff saved to https://phabricator.wikimedia.org/P27253 and previous config saved to /var/cache/conftool/dbconfig/20220502-045532-ladsgroup.json [04:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:37] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [04:56:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [04:56:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [04:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P27254 and previous config saved to /var/cache/conftool/dbconfig/20220502-045656-ladsgroup.json [04:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:01] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:57:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298295)', diff saved to https://phabricator.wikimedia.org/P27255 and previous config saved to /var/cache/conftool/dbconfig/20220502-045750-ladsgroup.json [04:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P27256 and previous config saved to /var/cache/conftool/dbconfig/20220502-050339-ladsgroup.json [05:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27257 and previous config saved to /var/cache/conftool/dbconfig/20220502-051255-ladsgroup.json [05:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [05:13:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [05:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P27258 and previous config saved to /var/cache/conftool/dbconfig/20220502-051402-ladsgroup.json [05:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:18:05] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:18:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P27259 and previous config saved to /var/cache/conftool/dbconfig/20220502-051844-ladsgroup.json [05:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:31] killed bnwiki's refresh links recommendation (T299021) [05:20:31] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [05:20:35] !log killed bnwiki's refresh links recommendation (T299021) [05:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27260 and previous config saved to /var/cache/conftool/dbconfig/20220502-052800-ladsgroup.json [05:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:03] (03PS3) 10KartikMistry: Enable SectionTranslation in testwiki for af, as, gu, kn, mk and sr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787868 (https://phabricator.wikimedia.org/T304828) [05:33:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T306560)', diff saved to https://phabricator.wikimedia.org/P27261 and previous config saved to /var/cache/conftool/dbconfig/20220502-053349-ladsgroup.json [05:33:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [05:33:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [05:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:54] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [05:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T306560)', diff saved to https://phabricator.wikimedia.org/P27262 and previous config saved to /var/cache/conftool/dbconfig/20220502-053357-ladsgroup.json [05:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P27263 and previous config saved to /var/cache/conftool/dbconfig/20220502-053615-ladsgroup.json [05:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:40:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [05:40:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [05:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T298563)', diff saved to https://phabricator.wikimedia.org/P27264 and previous config saved to /var/cache/conftool/dbconfig/20220502-054040-ladsgroup.json [05:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:44] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [05:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298295)', diff saved to https://phabricator.wikimedia.org/P27265 and previous config saved to /var/cache/conftool/dbconfig/20220502-054305-ladsgroup.json [05:43:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [05:43:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [05:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:10] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [05:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298295)', diff saved to https://phabricator.wikimedia.org/P27266 and previous config saved to /var/cache/conftool/dbconfig/20220502-054313-ladsgroup.json [05:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:40] (03PS6) 10Ladsgroup: TimedMediaHandler: Make videojs the only player everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612349 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [05:43:51] (03CR) 10jerkins-bot: [V: 04-1] TimedMediaHandler: Make videojs the only player everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612349 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [05:45:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298295)', diff saved to https://phabricator.wikimedia.org/P27267 and previous config saved to /var/cache/conftool/dbconfig/20220502-054530-ladsgroup.json [05:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:48] (03PS7) 10Ladsgroup: TimedMediaHandler: Make videojs the only player everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612349 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [05:51:17] (03CR) 10Ladsgroup: [C: 03+2] TimedMediaHandler: Make videojs the only player everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612349 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [05:51:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27268 and previous config saved to /var/cache/conftool/dbconfig/20220502-055121-ladsgroup.json [05:51:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P27269 and previous config saved to /var/cache/conftool/dbconfig/20220502-055121-ladsgroup.json [05:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:00] (03Merged) 10jenkins-bot: TimedMediaHandler: Make videojs the only player everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612349 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [05:53:23] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:612349|TimedMediaHandler: Make videojs the only player everywhere (T248418)]] (duration: 00m 47s) [05:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:27] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [05:56:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [05:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [05:59:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [05:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27270 and previous config saved to /var/cache/conftool/dbconfig/20220502-060035-ladsgroup.json [06:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27271 and previous config saved to /var/cache/conftool/dbconfig/20220502-060626-ladsgroup.json [06:06:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P27272 and previous config saved to /var/cache/conftool/dbconfig/20220502-060626-ladsgroup.json [06:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:44] 10SRE: phedenskog uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T307079 (10Peter) What's the preferred way to send the public key? [06:15:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27273 and previous config saved to /var/cache/conftool/dbconfig/20220502-061540-ladsgroup.json [06:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:55] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:21:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P27274 and previous config saved to /var/cache/conftool/dbconfig/20220502-062131-ladsgroup.json [06:21:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T306560)', diff saved to https://phabricator.wikimedia.org/P27275 and previous config saved to /var/cache/conftool/dbconfig/20220502-062131-ladsgroup.json [06:21:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [06:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [06:21:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:39] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [06:21:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T306560)', diff saved to https://phabricator.wikimedia.org/P27276 and previous config saved to /var/cache/conftool/dbconfig/20220502-062139-ladsgroup.json [06:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T306560)', diff saved to https://phabricator.wikimedia.org/P27277 and previous config saved to /var/cache/conftool/dbconfig/20220502-062659-ladsgroup.json [06:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:03] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [06:27:47] Can anyone refresh/update Deployment calendar? [06:30:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298295)', diff saved to https://phabricator.wikimedia.org/P27278 and previous config saved to /var/cache/conftool/dbconfig/20220502-063047-ladsgroup.json [06:30:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:30:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:51] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [06:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298295)', diff saved to https://phabricator.wikimedia.org/P27279 and previous config saved to /var/cache/conftool/dbconfig/20220502-063055-ladsgroup.json [06:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298295)', diff saved to https://phabricator.wikimedia.org/P27280 and previous config saved to /var/cache/conftool/dbconfig/20220502-063212-ladsgroup.json [06:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:25] kart_: someone else in releng may be able to run the script, or you might have to wait for thcipriani [06:37:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298563)', diff saved to https://phabricator.wikimedia.org/P27281 and previous config saved to /var/cache/conftool/dbconfig/20220502-063740-ladsgroup.json [06:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:45] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:38:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [06:38:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [06:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P27282 and previous config saved to /var/cache/conftool/dbconfig/20220502-063837-ladsgroup.json [06:38:40] (03PS1) 10Elukey: Upgrade ores2001's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) [06:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:41:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35017/console" [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [06:42:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P27283 and previous config saved to /var/cache/conftool/dbconfig/20220502-064204-ladsgroup.json [06:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:59] p858snake: Yes. Pinged on #wikimedia-relend We've deployment window in 15 minutes though :/ [06:47:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27284 and previous config saved to /var/cache/conftool/dbconfig/20220502-064717-ladsgroup.json [06:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:31] jouncebot: next [06:48:31] No deployments scheduled for the forseeable future! [06:48:36] :/ [06:51:50] (03PS2) 10Elukey: Upgrade ores2001's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) [06:51:52] (03PS1) 10Elukey: celery: fix version comparison in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/788277 [06:52:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27285 and previous config saved to /var/cache/conftool/dbconfig/20220502-065245-ladsgroup.json [06:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P27286 and previous config saved to /var/cache/conftool/dbconfig/20220502-065442-ladsgroup.json [06:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:56:12] (03PS2) 10Elukey: celery: fix version comparison in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/788277 [06:56:14] (03PS3) 10Elukey: Upgrade ores2001's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) [06:56:46] (03CR) 10jerkins-bot: [V: 04-1] celery: fix version comparison in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/788277 (owner: 10Elukey) [06:57:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P27287 and previous config saved to /var/cache/conftool/dbconfig/20220502-065709-ladsgroup.json [06:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27288 and previous config saved to /var/cache/conftool/dbconfig/20220502-070222-ladsgroup.json [07:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:25] (03PS3) 10Elukey: celery: fix version comparison in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/788277 [07:03:27] (03PS4) 10Elukey: Upgrade ores2001's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) [07:04:33] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35020/console" [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [07:07:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P27289 and previous config saved to /var/cache/conftool/dbconfig/20220502-070750-ladsgroup.json [07:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:25] (03PS4) 10Elukey: celery: fix version comparison in systemd template [puppet] - 10https://gerrit.wikimedia.org/r/788277 [07:08:27] (03PS5) 10Elukey: Upgrade ores2001's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) [07:09:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35021/console" [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [07:09:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27290 and previous config saved to /var/cache/conftool/dbconfig/20220502-070947-ladsgroup.json [07:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T306560)', diff saved to https://phabricator.wikimedia.org/P27291 and previous config saved to /var/cache/conftool/dbconfig/20220502-071214-ladsgroup.json [07:12:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [07:12:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [07:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:18] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [07:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T306560)', diff saved to https://phabricator.wikimedia.org/P27292 and previous config saved to /var/cache/conftool/dbconfig/20220502-071222-ladsgroup.json [07:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:43] (03PS6) 10Elukey: Upgrade ores2001's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) [07:13:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35022/console" [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [07:14:33] (03PS7) 10Elukey: Upgrade ores2001's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) [07:14:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T306560)', diff saved to https://phabricator.wikimedia.org/P27293 and previous config saved to /var/cache/conftool/dbconfig/20220502-071442-ladsgroup.json [07:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:32] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35023/console" [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [07:17:14] (03PS1) 10Ladsgroup: wikireplicas: Add linktarget to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/788278 (https://phabricator.wikimedia.org/T305064) [07:17:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298295)', diff saved to https://phabricator.wikimedia.org/P27294 and previous config saved to /var/cache/conftool/dbconfig/20220502-071728-ladsgroup.json [07:17:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:17:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:33] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [07:17:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298295)', diff saved to https://phabricator.wikimedia.org/P27295 and previous config saved to /var/cache/conftool/dbconfig/20220502-071741-ladsgroup.json [07:17:41] (03CR) 10Elukey: [V: 03+1 C: 03+2] Upgrade ores2001's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788276 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [07:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:08] (03PS2) 10Ladsgroup: wikireplicas: Add linktarget to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/788278 (https://phabricator.wikimedia.org/T305064) [07:19:29] (03CR) 10Ladsgroup: "Ping" [puppet] - 10https://gerrit.wikimedia.org/r/783845 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [07:19:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298295)', diff saved to https://phabricator.wikimedia.org/P27296 and previous config saved to /var/cache/conftool/dbconfig/20220502-071958-ladsgroup.json [07:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298563)', diff saved to https://phabricator.wikimedia.org/P27297 and previous config saved to /var/cache/conftool/dbconfig/20220502-072255-ladsgroup.json [07:22:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [07:22:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [07:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:00] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [07:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298563)', diff saved to https://phabricator.wikimedia.org/P27298 and previous config saved to /var/cache/conftool/dbconfig/20220502-072303-ladsgroup.json [07:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:23] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ores2001.codfw.wmnet with OS buster [07:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P27299 and previous config saved to /var/cache/conftool/dbconfig/20220502-072452-ladsgroup.json [07:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P27300 and previous config saved to /var/cache/conftool/dbconfig/20220502-072947-ladsgroup.json [07:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:22] 10SRE: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10fgiunchedi) Thank you for the clarification @Volans, we're definitely in a better place nowadays so IMHO this task is done according to the original description. What you mentioned sounds definitely... [07:35:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27301 and previous config saved to /var/cache/conftool/dbconfig/20220502-073503-ladsgroup.json [07:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:09] 10SRE: phedenskog uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T307079 (10fgiunchedi) >>! In T307079#7894859, @Peter wrote: > What's the preferred way to send the public key? For production you can send a review similar to https://gerrit.wikimedia.org/r/c/operations/puppet... [07:39:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P27302 and previous config saved to /var/cache/conftool/dbconfig/20220502-073958-ladsgroup.json [07:40:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:40:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P27303 and previous config saved to /var/cache/conftool/dbconfig/20220502-074006-ladsgroup.json [07:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P27304 and previous config saved to /var/cache/conftool/dbconfig/20220502-074452-ladsgroup.json [07:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:41] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2001.codfw.wmnet with reason: host reimage [07:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298563)', diff saved to https://phabricator.wikimedia.org/P27305 and previous config saved to /var/cache/conftool/dbconfig/20220502-074927-ladsgroup.json [07:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:31] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [07:50:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27306 and previous config saved to /var/cache/conftool/dbconfig/20220502-075008-ladsgroup.json [07:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:27] (03PS6) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [07:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:54:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2001.codfw.wmnet with reason: host reimage [07:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P27307 and previous config saved to /var/cache/conftool/dbconfig/20220502-075644-ladsgroup.json [07:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:49] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:59:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T306560)', diff saved to https://phabricator.wikimedia.org/P27308 and previous config saved to /var/cache/conftool/dbconfig/20220502-075957-ladsgroup.json [08:00:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:00:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:00:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [08:00:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T306560)', diff saved to https://phabricator.wikimedia.org/P27309 and previous config saved to /var/cache/conftool/dbconfig/20220502-080012-ladsgroup.json [08:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:04] (03PS1) 10DCausse: cirrus: Enable DeprecationLoggedHttps in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788282 (https://phabricator.wikimedia.org/T218994) [08:02:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T306560)', diff saved to https://phabricator.wikimedia.org/P27310 and previous config saved to /var/cache/conftool/dbconfig/20220502-080232-ladsgroup.json [08:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27311 and previous config saved to /var/cache/conftool/dbconfig/20220502-080432-ladsgroup.json [08:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298295)', diff saved to https://phabricator.wikimedia.org/P27312 and previous config saved to /var/cache/conftool/dbconfig/20220502-080513-ladsgroup.json [08:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:17] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [08:11:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27313 and previous config saved to /var/cache/conftool/dbconfig/20220502-081149-ladsgroup.json [08:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:26] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS buster [08:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P27314 and previous config saved to /var/cache/conftool/dbconfig/20220502-081737-ladsgroup.json [08:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P27315 and previous config saved to /var/cache/conftool/dbconfig/20220502-081937-ladsgroup.json [08:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:01] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:22:22] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [08:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:29] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 02m 06s) [08:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:52] (03CR) 10David Caro: [C: 03+2] P:toolforge::prometheus: simplify prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [08:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P27316 and previous config saved to /var/cache/conftool/dbconfig/20220502-082654-ladsgroup.json [08:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2001.codfw.wmnet with OS buster [08:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P27317 and previous config saved to /var/cache/conftool/dbconfig/20220502-083242-ladsgroup.json [08:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298563)', diff saved to https://phabricator.wikimedia.org/P27318 and previous config saved to /var/cache/conftool/dbconfig/20220502-083442-ladsgroup.json [08:34:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:34:47] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [08:34:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298563)', diff saved to https://phabricator.wikimedia.org/P27319 and previous config saved to /var/cache/conftool/dbconfig/20220502-083456-ladsgroup.json [08:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:18] (03PS1) 10Kormat: change_localuser.lu_attached_timestamp_T302659.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788288 (https://phabricator.wikimedia.org/T302659) [08:42:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P27320 and previous config saved to /var/cache/conftool/dbconfig/20220502-084200-ladsgroup.json [08:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:42:40] (03CR) 10Hashar: gerrit: keep computing changes mergeability (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786984 (https://phabricator.wikimedia.org/T303970) (owner: 10Hashar) [08:45:07] (03PS1) 10Kormat: cu_changes_actor_comment_cols_T303603.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788289 (https://phabricator.wikimedia.org/T303603) [08:45:26] (03CR) 10jerkins-bot: [V: 04-1] cu_changes_actor_comment_cols_T303603.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788289 (https://phabricator.wikimedia.org/T303603) (owner: 10Kormat) [08:45:27] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS buster [08:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:05] !log vgutierrez@apt1001:~$ sudo -i reprepro --component thirdparty/haproxy24 update buster-wikimedia [08:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:37] (03PS2) 10Kormat: cu_changes_actor_comment_cols_T303603.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788289 (https://phabricator.wikimedia.org/T303603) [08:47:15] !log test HAProxy 2.4.16 on cp4034 and cp4036 [08:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T306560)', diff saved to https://phabricator.wikimedia.org/P27321 and previous config saved to /var/cache/conftool/dbconfig/20220502-084747-ladsgroup.json [08:47:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [08:47:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [08:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:51] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [08:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:48:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P27322 and previous config saved to /var/cache/conftool/dbconfig/20220502-084812-ladsgroup.json [08:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P27323 and previous config saved to /var/cache/conftool/dbconfig/20220502-085032-ladsgroup.json [08:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:43] (03PS1) 10Kormat: fix_revision.rev_timestamp_type_T298560.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788290 (https://phabricator.wikimedia.org/T298560) [08:54:34] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2022-07-31 07:52:52 +0000 (expires in 89 days) https://phabricator.wikimedia.org/tag/toolforge/ [08:55:38] (03PS1) 10Kormat: fix_logging.log_timestamp_type_T298555.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) [08:57:14] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS buster [08:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:20] (03CR) 10Ladsgroup: [C: 03+1] change_localuser.lu_attached_timestamp_T302659.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788288 (https://phabricator.wikimedia.org/T302659) (owner: 10Kormat) [08:58:11] (03CR) 10Kormat: [C: 03+2] change_localuser.lu_attached_timestamp_T302659.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788288 (https://phabricator.wikimedia.org/T302659) (owner: 10Kormat) [08:59:10] (03Merged) 10jenkins-bot: change_localuser.lu_attached_timestamp_T302659.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788288 (https://phabricator.wikimedia.org/T302659) (owner: 10Kormat) [09:00:19] (03CR) 10Ladsgroup: [C: 03+1] cu_changes_actor_comment_cols_T303603.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788289 (https://phabricator.wikimedia.org/T303603) (owner: 10Kormat) [09:00:31] !log jynus@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1002.eqiad.wmnet with OS buster [09:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:51] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [09:00:53] (03CR) 10Kormat: [C: 03+2] cu_changes_actor_comment_cols_T303603.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788289 (https://phabricator.wikimedia.org/T303603) (owner: 10Kormat) [09:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:58] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS bullseye [09:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:14] (03Merged) 10jenkins-bot: cu_changes_actor_comment_cols_T303603.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788289 (https://phabricator.wikimedia.org/T303603) (owner: 10Kormat) [09:01:36] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [09:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:43] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS bullseye [09:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:34] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:03:54] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [09:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:01] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS bullseye [09:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298563)', diff saved to https://phabricator.wikimedia.org/P27324 and previous config saved to /var/cache/conftool/dbconfig/20220502-090415-ladsgroup.json [09:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:20] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [09:05:24] (03CR) 10Ladsgroup: [C: 04-1] fix_logging.log_timestamp_type_T298555.py: New schema change. (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat) [09:05:35] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [09:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P27325 and previous config saved to /var/cache/conftool/dbconfig/20220502-090537-ladsgroup.json [09:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:42] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS bullseye [09:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:30] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [09:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:59] (03PS2) 10Kormat: fix_logging.log_timestamp_type_T298555.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) [09:07:01] (03PS5) 10Jgiannelos: Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) [09:07:19] (03CR) 10Kormat: fix_logging.log_timestamp_type_T298555.py: New schema change. (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat) [09:11:47] (03CR) 10Ladsgroup: [C: 04-1] fix_logging.log_timestamp_type_T298555.py: New schema change. (032 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat) [09:14:14] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:19:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27326 and previous config saved to /var/cache/conftool/dbconfig/20220502-091920-ladsgroup.json [09:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:59] !log installing ghostscript security updates on Stretch (newer distros not affected) [09:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P27327 and previous config saved to /var/cache/conftool/dbconfig/20220502-092042-ladsgroup.json [09:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:56] (03CR) 10Ayounsi: WIP move core routers definitions to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:33:41] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Seen): Upgrade ci ssh key to ecdsa - https://phabricator.wikimedia.org/T177826 (10hashar) [09:34:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P27328 and previous config saved to /var/cache/conftool/dbconfig/20220502-093425-ladsgroup.json [09:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P27329 and previous config saved to /var/cache/conftool/dbconfig/20220502-093547-ladsgroup.json [09:35:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:35:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:52] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [09:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:36:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [09:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:18] (03PS1) 10Elukey: Upgrade ores2002's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788293 (https://phabricator.wikimedia.org/T303801) [09:36:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:36:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T306560)', diff saved to https://phabricator.wikimedia.org/P27330 and previous config saved to /var/cache/conftool/dbconfig/20220502-093628-ladsgroup.json [09:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:22] (03CR) 10Klausman: [C: 03+1] Upgrade ores2002's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788293 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [09:38:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T306560)', diff saved to https://phabricator.wikimedia.org/P27331 and previous config saved to /var/cache/conftool/dbconfig/20220502-093847-ladsgroup.json [09:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:35] (03PS1) 10David Caro: acme_chief: add log_level to the config [software/acme-chief] - 10https://gerrit.wikimedia.org/r/788294 (https://phabricator.wikimedia.org/T307333) [09:42:40] ^ what's the context for that CR? [09:42:43] dcaro ^^ [09:43:33] vgutierrez: hey, the task attached to it, we got another instance of acme-chief not reloading the certs, even with the watchdog "fix" [09:43:45] are you sure about that? [09:43:52] I don't see that on the task itself :) [09:43:53] that it happened yes [09:44:21] (03CR) 10JMeybohm: [C: 03+2] Remove k8s-ingress-wikikube.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/787756 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [09:45:13] dcaro: could you provide access to the instance or full acme-chief logs? [09:45:17] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: add log_level to the config [software/acme-chief] - 10https://gerrit.wikimedia.org/r/788294 (https://phabricator.wikimedia.org/T307333) (owner: 10David Caro) [09:45:35] vgutierrez: yes [09:45:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35024/console" [puppet] - 10https://gerrit.wikimedia.org/r/788293 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [09:45:58] (03CR) 10JMeybohm: [C: 03+2] Remove k8s-ingress-wikikube.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787757 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [09:46:44] (03CR) 10Klausman: [C: 03+2] Upgrade ores2002's celery settings [puppet] - 10https://gerrit.wikimedia.org/r/788293 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [09:47:01] vgutierrez: you should have access now [09:47:08] FQDN? :) [09:47:13] klausman: ok to merge? [09:47:21] yes [09:47:25] ack [09:47:41] jayme: now you own part of ORES [09:47:47] Mwhahahaha! [09:47:55] [lightning and thunder in the background] [09:47:58] ouch [09:48:16] I can revert! :D [09:48:21] too late :D [09:48:25] vgutierrez: tools-acme-chief-01.tools.eqiad.wmflabs [09:49:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298563)', diff saved to https://phabricator.wikimedia.org/P27332 and previous config saved to /var/cache/conftool/dbconfig/20220502-094930-ladsgroup.json [09:49:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [09:49:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [09:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:36] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [09:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298563)', diff saved to https://phabricator.wikimedia.org/P27333 and previous config saved to /var/cache/conftool/dbconfig/20220502-094938-ladsgroup.json [09:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:11] dcaro: maybe I'm one puppet run away of getting access to that instance? [09:50:17] dcaro: it looks like the instance itself is refusing my public key [09:50:29] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ores2002.codfw.wmnet with OS buster [09:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:40] should not need a puppet run (using ldap) [09:51:10] vgutierrez: reran puppet, you can try again (nothing changed though) [09:51:40] so... I can't access that instance for some reason [09:51:52] can you access any other cloud VM? [09:52:56] yes.. I just logged to traffic-cache-atstext-buster.traffic.eqiad1.wikimedia.cloud [09:53:23] same user/key? (what is your user?) [09:53:28] yes [09:53:30] vgutierrez [09:53:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P27334 and previous config saved to /var/cache/conftool/dbconfig/20220502-095352-ladsgroup.json [09:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:31] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) That's great! FYI, thanks to John rack and Ganeti cluster info from Netbox is available in Puppet, which could help aut... [09:55:29] jouncebot: now [09:55:29] No deployments scheduled for the next 3 hour(s) and 4 minute(s) [09:56:03] vgutierrez: https://phabricator.wikimedia.org/P27335 created a paste with the journal [09:57:39] going to ship a mw-config patch (labs only: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/788282) [09:58:10] dcaro: restarting acme-chief triggered a renewal... but that's weird, on May 1st acme-chief showed some activity [09:58:12] vgutierrez: the logs from may 02 are after the restart (when it started working) [09:58:13] (03CR) 10DCausse: [C: 03+2] cirrus: Enable DeprecationLoggedHttps in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788282 (https://phabricator.wikimedia.org/T218994) (owner: 10DCausse) [09:58:51] 10SRE-tools, 10DNS, 10Infrastructure-Foundations: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10Volans) Ack, let me repurpose this one. [09:58:53] (03Merged) 10jenkins-bot: cirrus: Enable DeprecationLoggedHttps in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788282 (https://phabricator.wikimedia.org/T218994) (owner: 10DCausse) [09:58:54] dcaro: are we sure that that instance had applied the watchdog update? [09:59:00] yes [09:59:01] I've seen the watchdog in action in our instances [09:59:28] root@tools-acme-chief-01:~# grep -i watchdog /etc/systemd/system/acme-chief.service.d/puppet-override.conf [09:59:30] WatchdogSec=600 [09:59:53] root@tools-acme-chief-01:~# apt policy acme-chief | grep -i installed [09:59:57] Installed: 0.34-1 [10:00:11] dcaro: at least what's missing on that journal output is one config reload attempt per hour [10:00:31] (03PS1) 10Filippo Giunchedi: clinic-duty: add Orange support [software] - 10https://gerrit.wikimedia.org/r/788296 [10:00:33] (03PS1) 10Filippo Giunchedi: clinic-duty: stop using 'document' to make tests pass [software] - 10https://gerrit.wikimedia.org/r/788297 [10:00:47] 10SRE-tools, 10DNS, 10Infrastructure-Foundations: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10Volans) [10:00:52] dcaro: without those scheduled config reloads acme-chief isn't able to renew the certificates [10:01:09] dcaro: and that would explain the behavior that you're seeing [10:01:20] (03CR) 10Filippo Giunchedi: "Timo, I'm not sure the way I chose to be able to set textCache makes sense here, what do you think?" [software] - 10https://gerrit.wikimedia.org/r/788296 (owner: 10Filippo Giunchedi) [10:01:54] (03CR) 10Filippo Giunchedi: "Timo, re: your change (thanks!) Ia27b2be79 which I merged yesterday" [software] - 10https://gerrit.wikimedia.org/r/788297 (owner: 10Filippo Giunchedi) [10:02:07] vgutierrez: yep, but why did that not happen? [10:03:17] dcaro: so.. in the production environment that's handled by a systemd timer called reload-acme-chief-backend [10:03:42] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:03:42] Active: inactive (dead) since Mon 2021-11-22 23:57:48 UTC; 5 months 8 days ago [10:03:53] vgutierrez: it's there, but has not run in a long time [10:04:03] dcaro: that's what you need to fix [10:04:10] the timer active [10:04:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:09] vgutierrez@acmechief1001:~$ systemctl list-timers reload-acme-chief-backend.timer [10:05:09] NEXT LEFT LAST PASSED UNIT ACTIVATES [10:05:09] Mon 2022-05-02 10:30:27 UTC 25min left Mon 2022-05-02 09:30:27 UTC 34min ago reload-acme-chief-backend.timer reload-acme-chief-backend.service [10:05:28] it has no next [10:05:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:05:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:37] dcaro: uh? :) [10:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:55] vgutierrez: n/a n/a Mon 2021-11-22 23:57:47 UTC 5 months 8 days ago reload-acme-chief-backend.timer reload-acme-chief-backend.service [10:06:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:46] dcaro: hmm that same day 5366563476ce85e88c4334423e339e8012cc380c was merged [10:08:50] dcaro: which debian/systemd version is running on that instance? [10:08:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P27337 and previous config saved to /var/cache/conftool/dbconfig/20220502-100857-ladsgroup.json [10:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:30] vgutierrez: buster (10.12), with systemd 241 [10:09:49] wait buster or stretch? ... anyhow 10.12 [10:10:05] buster [10:10:10] buster yep [10:10:17] (/me gets confused with the numbers/names) [10:10:30] thanks :) [10:12:45] (03CR) 10Filippo Giunchedi: "Thank you Majavah for this! See inline" [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah) [10:14:56] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:15:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298563)', diff saved to https://phabricator.wikimedia.org/P27338 and previous config saved to /var/cache/conftool/dbconfig/20220502-101518-ladsgroup.json [10:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:23] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [10:15:51] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2002.codfw.wmnet with reason: host reimage [10:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:44] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2002.codfw.wmnet with reason: host reimage [10:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:02] !log klausman@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ores2002.codfw.wmnet with OS buster [10:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:05] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:50] (03PS1) 10Majavah: ssl: Add dummy key for toolsbeta k8s prometheus [labs/private] - 10https://gerrit.wikimedia.org/r/788303 (https://phabricator.wikimedia.org/T304716) [10:24:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T306560)', diff saved to https://phabricator.wikimedia.org/P27340 and previous config saved to /var/cache/conftool/dbconfig/20220502-102402-ladsgroup.json [10:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:09] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [10:28:40] (03PS1) 10Majavah: P:toolforge::prometheus: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/788304 [10:28:42] (03PS1) 10Majavah: P:toolforge::prometheus: add toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/788305 (https://phabricator.wikimedia.org/T304716) [10:30:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35025/console" [puppet] - 10https://gerrit.wikimedia.org/r/788305 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [10:30:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27341 and previous config saved to /var/cache/conftool/dbconfig/20220502-103023-ladsgroup.json [10:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:07] (03CR) 10jerkins-bot: [V: 04-1] P:toolforge::prometheus: add toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/788305 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [10:32:21] (03PS1) 10DCausse: cirrus: fix transport to use https instead of http (labs only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788307 [10:33:04] I need to ship this small followup ^ [10:33:58] (03CR) 10DCausse: [C: 03+2] cirrus: fix transport to use https instead of http (labs only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788307 (owner: 10DCausse) [10:34:39] (03Merged) 10jenkins-bot: cirrus: fix transport to use https instead of http (labs only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788307 (owner: 10DCausse) [10:34:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [10:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:05] (03PS1) 10Elukey: sre.hosts.reimage: check if the first puppet output is not None [cookbooks] - 10https://gerrit.wikimedia.org/r/788308 [10:37:41] RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [10:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:22] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1002.eqiad.wmnet with OS bullseye [10:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [10:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:56] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1002.eqiad.wmnet with OS bullseye [10:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:44:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P27342 and previous config saved to /var/cache/conftool/dbconfig/20220502-104528-ladsgroup.json [10:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:22] (03Abandoned) 10David Caro: acme_chief: add log_level to the config [software/acme-chief] - 10https://gerrit.wikimedia.org/r/788294 (https://phabricator.wikimedia.org/T307333) (owner: 10David Caro) [10:46:54] (03CR) 10Volans: [C: 03+1] "Sure, that seems a good fix. I'm wondering why the puppet run was skipped, was it manually fixed? Ideally every host should run their firs" [cookbooks] - 10https://gerrit.wikimedia.org/r/788308 (owner: 10Elukey) [10:47:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [10:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:57] !log klausman@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [10:48:58] (03PS1) 10Vgutierrez: acme_chief::server: Enable monitoring for reload-acme-chief-backend timer [puppet] - 10https://gerrit.wikimedia.org/r/788310 [10:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:02] !log klausman@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 05s) [10:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:21] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35026/console" [puppet] - 10https://gerrit.wikimedia.org/r/788310 (owner: 10Vgutierrez) [10:51:55] (03CR) 10Klausman: sre.hosts.reimage: check if the first puppet output is not None (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/788308 (owner: 10Elukey) [10:56:44] (03PS2) 10Vgutierrez: acme_chief::server: Enable monitoring for reload-acme-chief-backend timer [puppet] - 10https://gerrit.wikimedia.org/r/788310 [10:57:07] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [10:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:13] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 06s) [10:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:28] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [10:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:08] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 40s) [10:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:54] (03PS2) 10Majavah: P:toolforge::prometheus: add toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/788305 (https://phabricator.wikimedia.org/T304716) [10:58:56] (03PS1) 10David Caro: acme_chief::server: remove sre-traffic email from timer [puppet] - 10https://gerrit.wikimedia.org/r/788312 [10:59:09] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [10:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:55] (03CR) 10Vgutierrez: "This effectively sends emails to the root@ account, I think it should be mentioned on the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/788312 (owner: 10David Caro) [11:00:32] (03CR) 10Vgutierrez: [C: 03+2] acme_chief::server: Enable monitoring for reload-acme-chief-backend timer [puppet] - 10https://gerrit.wikimedia.org/r/788310 (owner: 10Vgutierrez) [11:00:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298563)', diff saved to https://phabricator.wikimedia.org/P27343 and previous config saved to /var/cache/conftool/dbconfig/20220502-110033-ladsgroup.json [11:00:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:00:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:38] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [11:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:41] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [11:00:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298563)', diff saved to https://phabricator.wikimedia.org/P27344 and previous config saved to /var/cache/conftool/dbconfig/20220502-110041-ladsgroup.json [11:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:00] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 19s) [11:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:15] (03CR) 10jerkins-bot: [V: 04-1] P:toolforge::prometheus: add toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/788305 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [11:01:38] (03CR) 10Elukey: [C: 03+2] sre.hosts.reimage: check if the first puppet output is not None [cookbooks] - 10https://gerrit.wikimedia.org/r/788308 (owner: 10Elukey) [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:04:41] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [11:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:46] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 05s) [11:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:27] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [11:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:12] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 45s) [11:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:49] !log rolling upgrade of HAProxy in ulsfo [11:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:02] jouncebot: next [11:15:02] In 1 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220502T1300) [11:25:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298563)', diff saved to https://phabricator.wikimedia.org/P27345 and previous config saved to /var/cache/conftool/dbconfig/20220502-112502-ladsgroup.json [11:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:08] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [11:33:05] 10SRE, 10Release-Engineering-Team: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10elukey) p:05Triage→03High [11:40:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27346 and previous config saved to /var/cache/conftool/dbconfig/20220502-114007-ladsgroup.json [11:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:31] 10SRE, 10Release-Engineering-Team: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10elukey) As suggested in the chat, I have created `/var/lock/scap-global-lock` with `Please check https://phabricator.wikimedia.org/T307349` [11:48:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10ayounsi) As services, infra and best practices changes over time (and through the 5 years servers lifetime) it's possible th... [11:49:39] 10SRE, 10Release-Engineering-Team: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10elukey) A backup is being placed under /srv/restore/srv/deployment on deploy1002 by Jaime. The last backup was taken today at 04:13 UTC. The time of the deleti... [11:50:39] 10SRE, 10Release-Engineering-Team, 10bacula: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10jcrespo) [11:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:55:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P27347 and previous config saved to /var/cache/conftool/dbconfig/20220502-115513-ladsgroup.json [11:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:03] (03CR) 10Ayounsi: "A few post merge comments" [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:03:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Majavah) There are three main flows that currently utilize the public IPs of cloudcontrols (unless I'm missing something): 1... [12:05:42] 10SRE, 10Release-Engineering-Team, 10bacula: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10elukey) The last error that puppet highlights is: ` Error: Execution of '/usr/bin/scap deploy --init' returned 1: Error: /Stage[main]/Profile::Med... [12:10:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298563)', diff saved to https://phabricator.wikimedia.org/P27349 and previous config saved to /var/cache/conftool/dbconfig/20220502-121018-ladsgroup.json [12:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:23] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [12:10:45] 10SRE, 10Release-Engineering-Team, 10bacula: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10jcrespo) Bacula recovery log for the record: {P27348} [12:17:05] 10SRE, 10Deployments, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10hashar) p:05High→03Unbreak! This is 100% an unbreak now. [12:19:14] jouncebot: next [12:19:14] In 0 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220502T1300) [12:19:16] jouncebot: now [12:19:16] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [12:25:32] 10SRE, 10Deployments, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10jcrespo) Restore finished ok: ` 02-May 12:22 backup1001.eqiad.wmnet-fd JobId 437308: Elapsed time=00:37:57, Transfer rate=2... [12:26:41] 10SRE, 10Deployments, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10hashar) [12:34:33] (03PS3) 10Kormat: fix_logging.log_timestamp_type_T298555.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) [12:35:11] (03CR) 10Kormat: fix_logging.log_timestamp_type_T298555.py: New schema change. (032 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat) [12:35:42] (03PS2) 10Kormat: fix_revision.rev_timestamp_type_T298560.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788290 (https://phabricator.wikimedia.org/T298560) [12:36:12] 10SRE, 10Deployments, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10hashar) I checked the HEAD of all git repos under `/srv/deployment` with: find /srv/deployment -name .git -print0 -ex... [12:36:59] (03PS1) 10Kormat: Skip first line of output from `db.run_sql` [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788325 [12:37:34] (03CR) 10Kormat: fix_logging.log_timestamp_type_T298555.py: New schema change. (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat) [12:38:10] (03PS1) 10Filippo Giunchedi: team-sre: introduce paging probe down [alerts] - 10https://gerrit.wikimedia.org/r/788346 (https://phabricator.wikimedia.org/T291946) [12:40:22] (03CR) 10jerkins-bot: [V: 04-1] team-sre: introduce paging probe down [alerts] - 10https://gerrit.wikimedia.org/r/788346 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:41:18] (03PS2) 10Filippo Giunchedi: team-sre: introduce paging probe down [alerts] - 10https://gerrit.wikimedia.org/r/788346 (https://phabricator.wikimedia.org/T291946) [12:45:11] !log dbmaint Deploying schema change to s2 (T303603) [12:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:17] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:48:01] !log swapped /srv/deployment directory on deploy1002 with the one from the latest backup - T307349 [12:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:06] T307349: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 [12:51:52] 10SRE, 10Deployments, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10hashar) After restore: ` $ colordiff -U0 --text deploy2002 deploy1002 --- deploy2002 2022-05-02 14:51:09.946189381 +0200 +... [12:52:39] 10SRE, 10Scap: Add new user identity to Keyholder - https://phabricator.wikimedia.org/T307351 (10jnuche) [12:55:32] !log dbmaint Deploying schema change to s2@codfw (T303603) [12:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:36] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:57:45] !log rolling upgrade of HAProxy in drmrs [12:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:12] 10SRE, 10Deployments, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10Volans) >>! In T307349#7895662, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=http... [12:58:40] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 9 hosts with reason: Deploying schema change to s2@codfw T303603 [12:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 9 hosts with reason: Deploying schema change to s2@codfw T303603 [12:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220502T1300). [13:00:05] kart_, awight, and nemo-yiannis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] * kart_ is here [13:00:30] I don’t think we can deploy yet, right? [13:00:32] hey [13:00:47] * urbanecm waves [13:00:51] T307349 still ongoing, I believe [13:00:51] T307349: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 [13:00:52] Lucas_WMDE: not yet, correct [13:00:55] ack [13:01:06] Ouch :/ [13:01:06] but we should be close to the full resolution [13:01:16] great! [13:01:17] Let me reschedule my config change later.. [13:01:17] Hi folks! [13:01:27] kart_: sounds like we might be able to do it later in the window [13:01:29] this is my bad, but the restored versions are already in place [13:01:42] we are checking that all is good, apologies for the problems [13:02:28] Lucas_WMDE: OK. I'll be here :) [13:03:41] kart_, Lucas_WMDE - if you have repos under /srv/deployment on deploy1002 and you want to check that everything is good [13:03:48] (if you have time I mean) [13:03:56] I don’t think I have anything there [13:04:11] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) (owner: 10Awight) [13:04:18] I think the only dirs I’ve dealt with are mediawiki-staging and deployment-charts [13:04:53] basically we are trying to confirm the restore worked ok [13:05:02] before allowing to continue new deployments [13:05:25] (03CR) 10WMDE-Fisch: [C: 03+1] Enable versioned maps everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) (owner: 10Awight) [13:05:55] (03PS2) 10Awight: Enable versioned maps everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) [13:06:43] (03CR) 10Ottomata: [C: 03+1] Image Suggestions Feedback Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787749 (owner: 10Luke Bowmaker) [13:07:29] /srv/deployment/cxserver [13:07:46] is deprecated. Oh IRC took it as command :) [13:07:59] ^^ [13:09:00] most clients let you type something like //etc/foo that will result in [13:09:02] /etc/foo [13:09:13] nemo-yiannis: ok to start removing files from 'tegola-swift-container' and eventually the whole container/bucket ? [13:10:17] godog: by any chance, do you do anything with scap3 on deploymet hosts? [13:10:30] jynus: I do not, no [13:10:52] jynus: not on a regular basis anyways now that I think of it, I deploy librenms from time to time for upgrades [13:11:19] that helps [13:11:31] could you log in and check it looks ok on deploy1002? [13:11:36] we just did a data recovery [13:11:37] godog: i just double checked that we dont use it in our current setup, i think its okto remove it [13:11:45] nemo-yiannis: ack, thanks! [13:11:46] *ok to [13:11:47] jynus: will do [13:11:59] taavi: noted. Thanks :) [13:12:20] jynus: LGTM [13:13:19] !log start removal of 'tegola-swift-container' and its objects - T307184 [13:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:24] T307184: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 [13:13:36] (03CR) 10Awight: [C: 04-2] "DNM until we've announced in Tech News" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) (owner: 10Awight) [13:18:18] 10SRE, 10Deployments, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10hashar) Ran the git rev-parse HEAD again: ` $ diff --text -U0 deploy2002 deploy1002 --- deploy2002 2022-05-02 15:17:27.8861... [13:21:46] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] ssl: Add dummy key for toolsbeta k8s prometheus [labs/private] - 10https://gerrit.wikimedia.org/r/788303 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [13:22:31] 10SRE, 10Deployments, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10elukey) p:05Unbreak!→03Medium Everything got restored, and /var/lock/scap-global-lock has been removed, deployments can... [13:22:51] (03PS2) 10Majavah: P:toolforge::prometheus: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/788304 [13:22:53] (03PS3) 10Majavah: P:toolforge::prometheus: add toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/788305 (https://phabricator.wikimedia.org/T304716) [13:23:22] (03CR) 10Andrew Bogott: [C: 03+2] prometheus-openstack-exporter: only run it on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [13:24:37] (03CR) 10Andrew Bogott: [C: 03+1] Update eqiad1 cloudservices openstack [puppet] - 10https://gerrit.wikimedia.org/r/774523 (https://phabricator.wikimedia.org/T304880) (owner: 10Vivian Rook) [13:24:53] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35027/console" [puppet] - 10https://gerrit.wikimedia.org/r/788305 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [13:25:09] (03CR) 10Vivian Rook: [C: 03+2] Update eqiad1 cloudservices openstack [puppet] - 10https://gerrit.wikimedia.org/r/774523 (https://phabricator.wikimedia.org/T304880) (owner: 10Vivian Rook) [13:28:26] (03PS2) 10David Caro: acme_chief::server: remove sre-traffic email from timer [puppet] - 10https://gerrit.wikimedia.org/r/788312 [13:33:47] (fine to skip my config patch for today) [13:34:02] still doing the final checks :) [13:34:19] (03CR) 10Winston Sung: "This change is ready for review." [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788329 (https://phabricator.wikimedia.org/T307298) (owner: 10Winston Sung) [13:35:29] no rush! [13:36:00] (03CR) 10Winston Sung: "recheck" [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788329 (https://phabricator.wikimedia.org/T307298) (owner: 10Winston Sung) [13:38:44] 10SRE, 10Deployments, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10hashar) From IRC the big diff at T307349#7895665 shows that deploy2002 has repositories with a more up to date git commit.... [13:39:29] (03PS3) 10Winston Sung: Localisation updates from https://translatewiki.net. [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788329 (https://phabricator.wikimedia.org/T307298) [13:40:03] (03PS4) 10Winston Sung: Localisation updates from https://translatewiki.net. [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788329 (https://phabricator.wikimedia.org/T307298) [13:41:36] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup2001.codfw.wmnet [13:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:48] Would it be okay to do this kind of localisation updates to fix an issue on zhwiki? [13:41:48] https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/788329 [13:42:00] * localisation backport [13:42:23] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10ayounsi) [13:43:32] (03PS4) 10David Caro: Move from deprecated icinga_hosts to alerting_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) [13:43:34] (03PS1) 10David Caro: wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 [13:44:59] Winston_Sung[m]: I don’t think we’ll have time for a full scap in the rest of the ongoing deployment window [13:45:03] jouncebot: next [13:45:03] In 1 hour(s) and 44 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220502T1530) [13:45:52] Lucas_WMDE: we don't have time for config change too, right? [13:46:09] I think we might, at least for one change [13:46:14] (03CR) 10Andrew Bogott: [C: 03+2] Update grants to reflect replacement of cloudweb2001 with cloudweb2002 [puppet] - 10https://gerrit.wikimedia.org/r/785395 (owner: 10Andrew Bogott) [13:46:41] hello folks, deployments unblocked [13:46:53] please be extra careful when checking commits etc.. before the deploy [13:46:56] Nice recovery! [13:47:02] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [13:47:16] we restored a backup from this morning UTC time and all should be good, we checked a lot of things but please be vigilant today :) [13:47:19] alright, thank elukey! [13:47:25] apologies again for the issue :( [13:48:01] (03CR) 10Ayounsi: [C: 03+2] replace_device: actually save the cable modification [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/785272 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [13:48:04] I think we can do kart_’s testwiki change then [13:48:11] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [13:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:24] Lucas_WMDE: cool. [13:48:30] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 18s) [13:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:40] (03Merged) 10jenkins-bot: replace_device: actually save the cable modification [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/785272 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [13:49:10] (03PS4) 10Lucas Werkmeister (WMDE): Enable SectionTranslation in testwiki for af, as, gu, kn, mk and sr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787868 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [13:49:44] !log rolling upgrade of HAProxy in codfw [13:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:49] elukey but now you get your obligatory "The time I rm -rf'd in production" story! :) [13:49:55] Everyone has to have one. [13:50:05] Moar T-shirts! [13:50:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable SectionTranslation in testwiki for af, as, gu, kn, mk and sr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787868 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [13:50:37] perryprog: :) [13:50:56] t-shirt: “I rm -rf’d up”? [13:50:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2004.codfw.wmnet with OS bullseye [13:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:02] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2004.codfw.wmnet with OS bullseye [13:51:16] (03Merged) 10jenkins-bot: Enable SectionTranslation in testwiki for af, as, gu, kn, mk and sr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787868 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [13:51:40] kart_: the change should be on mwdebug1001, can you test it? [13:51:45] 10SRE, 10Deployments, 10Parsoid, 10bacula, 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10jcrespo) Adding parsoid team here for awareness- please check the repo and all its submodules look as expected... [13:52:03] or a happy couple of years I had an official "I broke Wikipedia (but then fixed it)" shirt. [13:52:07] *For [13:52:53] Lucas_WMDE: testing.. [13:52:58] ok thanks [13:52:59] !log rook@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: upgrading openstack [13:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:04] !log rook@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: upgrading openstack [13:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:21] !log rook@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 118 hosts with reason: upgrading openstack [13:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:42] Lucas_WMDE: looks good! [13:53:46] ok! [13:54:31] hm, I’m not seeing anything in the mwdebug logstash o_O [13:54:37] !log rook@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 118 hosts with reason: upgrading openstack [13:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:14] but I see something after going there myself (on wikidata) [13:55:33] maybe it just didn’t generate any log messages [13:55:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2004.codfw.wmnet with reason: host reimage [13:55:50] syncing [13:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:55] (03PS2) 10David Caro: wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 [13:56:36] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:787868|Enable SectionTranslation in testwiki for af, as, gu, kn, mk and sr (T304828, T304858)]] (duration: 00m 49s) [13:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:41] T304828: Enable Section Translation in 13 wikis where Content Translation is already available as default - https://phabricator.wikimedia.org/T304828 [13:56:42] T304858: Enable Content and Section Translation for Serbian Wikipedia - https://phabricator.wikimedia.org/T304858 [13:56:55] Lucas_WMDE: Thanks! [13:57:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:48] I’ll think I’ll leave it there, unless someone else wants to deploy other things from this window [13:58:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:58:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:43] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [13:58:47] (03CR) 10Gergő Tisza: Newcomer tasks: deploy AND topic selection to pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [13:59:11] !log UTC afternoon backport+config window done [13:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2004.codfw.wmnet with reason: host reimage [13:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:29] +1 no need to push our luck. [13:59:43] sorry to the other three, please reschedule your deployments… [14:01:42] (03PS21) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:02:18] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:05:04] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudbackup2001.codfw.wmnet [14:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:34] 10SRE, 10ops-codfw, 10DC-Ops: mw2286 stuck after reboot - https://phabricator.wikimedia.org/T306823 (10Papaul) Mon Apr 25 2022 15:42:32 Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. Mon Apr 25 2022 15:42:32 A problem was detected in Memory Reference Code (MRC). Mon Apr... [14:07:11] (03PS22) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:07:45] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:08:35] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:36] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) [14:08:49] sukhe: ^^ [14:09:27] yeah, I put in a fix yesterday as well but clearly [14:10:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2004.codfw.wmnet with OS bullseye [14:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:42] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2004.codfw.wmnet with OS bullseye completed: - aqs2004 (**PASS**)... [14:11:03] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup2001.codfw.wmnet [14:11:04] !log dcaro@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudbackup2001.codfw.wmnet [14:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:25] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:48] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ores2002.codfw.wmnet with OS buster [14:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:15] (03PS23) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:18:24] (03PS1) 10Lucas Werkmeister (WMDE): Use "unexpectedUnconnectedPage" page prop everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788356 [14:23:13] (03PS19) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [14:25:42] (03CR) 10Ayounsi: [C: 03+2] Prevent re-using network ports when provisioning hosts in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780517 (https://phabricator.wikimedia.org/T272068) (owner: 10Ayounsi) [14:26:18] (03Merged) 10jenkins-bot: Prevent re-using network ports when provisioning hosts in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780517 (https://phabricator.wikimedia.org/T272068) (owner: 10Ayounsi) [14:26:56] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Prevent re-using network ports when provisioning hosts in Netbox - https://phabricator.wikimedia.org/T272068 (10ayounsi) 05Open→03Resolved [14:27:08] (03PS2) 10Ayounsi: Remove support for legacy ELS junos syntax [homer/public] - 10https://gerrit.wikimedia.org/r/785273 [14:28:09] (03CR) 10Ayounsi: [C: 03+2] Remove support for legacy ELS junos syntax [homer/public] - 10https://gerrit.wikimedia.org/r/785273 (owner: 10Ayounsi) [14:28:40] (03Merged) 10jenkins-bot: Remove support for legacy ELS junos syntax [homer/public] - 10https://gerrit.wikimedia.org/r/785273 (owner: 10Ayounsi) [14:29:52] (03PS1) 10Vivian Rook: upgrade openstack to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/788359 (https://phabricator.wikimedia.org/T281275) [14:31:07] (03PS24) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:31:29] (03CR) 10Andrew Bogott: [C: 03+1] upgrade openstack to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/788359 (https://phabricator.wikimedia.org/T281275) (owner: 10Vivian Rook) [14:31:37] (03CR) 10Vivian Rook: [C: 03+2] upgrade openstack to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/788359 (https://phabricator.wikimedia.org/T281275) (owner: 10Vivian Rook) [14:33:20] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:34:09] (03PS20) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [14:36:36] (03PS25) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:37:31] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2002.codfw.wmnet with reason: host reimage [14:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:38:29] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:40:30] (03PS26) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:40:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2002.codfw.wmnet with reason: host reimage [14:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:41] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:47:09] (03PS27) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:50:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:50:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2005.codfw.wmnet with OS bullseye [14:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:36] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2005.codfw.wmnet with OS bullseye [14:50:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2005.codfw.wmnet with OS bullseye [14:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:48] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2005.codfw.wmnet with OS bullseye executed with errors: - aqs2005 (... [14:55:00] PROBLEM - Check systemd state on an-airflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:44] oops, that was me, fixing [14:57:02] (03PS28) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:58:32] RECOVERY - Check systemd state on an-airflow1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2005.codfw.wmnet with OS bullseye [14:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:53] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:58:55] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2005.codfw.wmnet with OS bullseye [15:01:24] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:06:14] (03PS2) 10Herron: Split watchrat URLs by need of proxy usage [puppet] - 10https://gerrit.wikimedia.org/r/776878 (https://phabricator.wikimedia.org/T303803) (owner: 10Alexandros Kosiaris) [15:09:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2006.mgmt.codfw.wmnet with reboot policy FORCED [15:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:24] (03PS3) 10Herron: Split watchrat URLs by need of proxy usage [puppet] - 10https://gerrit.wikimedia.org/r/776878 (https://phabricator.wikimedia.org/T303803) (owner: 10Alexandros Kosiaris) [15:12:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [15:13:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2002.codfw.wmnet with OS buster [15:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:41] (03CR) 10Herron: [C: 03+2] "Thanks for putting this together! Made a small amendment to use an 'http_connect_23xx' module so the proxied and non-proxied variants are" [puppet] - 10https://gerrit.wikimedia.org/r/776878 (https://phabricator.wikimedia.org/T303803) (owner: 10Alexandros Kosiaris) [15:18:21] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:54] Would an operator kindly set me as the person in the topic who is on clinic duty? [15:44:59] (03CR) 10Dzahn: "ok, great :)" [puppet] - 10https://gerrit.wikimedia.org/r/786984 (https://phabricator.wikimedia.org/T303970) (owner: 10Hashar) [15:45:01] (03PS1) 10Andrew Bogott: Disable EventLogging extension [wikitech-static] - 10https://gerrit.wikimedia.org/r/788370 [15:45:40] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:40] (03CR) 10Andrew Bogott: "This is already hotfixed on wikitech-static but checking in to make sure that's not somehow terrible." [wikitech-static] - 10https://gerrit.wikimedia.org/r/788370 (owner: 10Andrew Bogott) [15:46:32] RECOVERY - Host mw2286 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [15:46:38] PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Puppet last ran 7 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:47:08] RECOVERY - Host mw2286.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.56 ms [15:47:09] !log rolling upgrade of HAProxy in eqsin [15:47:11] !log upgrading wikitech-static to REL1_38. this in includes a hotfix of https://gerrit.wikimedia.org/r/c/operations/wikitech-static/+/788370 [15:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2005.codfw.wmnet with reason: host reimage [15:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:41] (03PS3) 10David Caro: wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 [15:50:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2005.codfw.wmnet with reason: host reimage [15:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:51] 10SRE, 10ops-codfw, 10DC-Ops: mw2286 stuck after reboot - https://phabricator.wikimedia.org/T306823 (10Papaul) 05Open→03Resolved a:03Papaul Replaced DIMM-A1 server is back up. [15:51:40] RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:52:39] (03CR) 10Ladsgroup: "This would break in dry runs. We can say whatever and that's acceptable to me but want to flag it." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788325 (owner: 10Kormat) [15:53:09] (03CR) 10Ladsgroup: fix_logging.log_timestamp_type_T298555.py: New schema change. (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat) [15:54:15] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [15:55:40] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:56:26] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [15:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:53] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 27s) [15:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:32] PROBLEM - Check systemd state on cloudelastic1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:45] 10ops-codfw: decommission labstore-sparearray2001 - https://phabricator.wikimedia.org/T307370 (10Papaul) [15:58:12] PROBLEM - mediawiki-installation DSH group on mw2286 is CRITICAL: Host mw2286 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:58:28] andrewbogott: fyi, the issues on labweb (due to etcd cert) (that jobs timer) - was all resolved [16:02:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2005.codfw.wmnet with OS bullseye [16:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:10] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2005.codfw.wmnet with OS bullseye completed: - aqs2005 (**PASS**)... [16:02:14] !log mw2286: scap pull and recheck icinga checks. Server came up after hardware failure [16:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2007.codfw.wmnet with OS bullseye [16:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:16] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2007.codfw.wmnet with OS bullseye [16:04:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2006.mgmt.codfw.wmnet with reboot policy FORCED [16:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2006.codfw.wmnet with OS bullseye [16:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:34] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2006.codfw.wmnet with OS bullseye [16:06:48] !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=mw2286.codfw.wmnet [16:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:31] mutante: good to know, thanks [16:12:38] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:22] (03PS1) 10Andrew Bogott: site.pp: remove cloudvirt101[2,3,4,5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/788375 (https://phabricator.wikimedia.org/T260840) [16:16:38] 10ops-codfw: decommission labstore-sparearray2001 - https://phabricator.wikimedia.org/T307370 (10Papaul) [16:16:47] 10ops-codfw: decommission labstore-sparearray2001 - https://phabricator.wikimedia.org/T307370 (10Papaul) 05Open→03Resolved complete [16:16:55] (03PS2) 10Andrew Bogott: site.pp: remove cloudvirt101[2,3,4,5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/788375 (https://phabricator.wikimedia.org/T260840) [16:17:07] (03CR) 10Vivian Rook: [C: 03+1] site.pp: remove cloudvirt101[2,3,4,5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/788375 (https://phabricator.wikimedia.org/T260840) (owner: 10Andrew Bogott) [16:17:39] (03CR) 10Andrew Bogott: [C: 03+2] site.pp: remove cloudvirt101[2,3,4,5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/788375 (https://phabricator.wikimedia.org/T260840) (owner: 10Andrew Bogott) [16:18:28] RECOVERY - mediawiki-installation DSH group on mw2286 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:28:55] (03PS4) 10David Caro: wmcs: Fix types and associated code refactor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 [16:31:46] 10SRE, 10Dumps-Generation, 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4), 10Security: Remaining data engineering host security restarts - https://phabricator.wikimedia.org/T307055 (10razzi) I sent out an email that all the named hosts, other than an-airflow, will be... [16:34:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2007.codfw.wmnet with OS bullseye [16:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:57] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2007.codfw.wmnet with OS bullseye executed with errors: - aqs2007 (... [16:35:27] (03CR) 10Reedy: "EventLogging probably isn't needed on wikitech-static, but the error is coming from `User::getIntOption` being removed in REL1_38, but bas" [wikitech-static] - 10https://gerrit.wikimedia.org/r/788370 (owner: 10Andrew Bogott) [16:37:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2006.codfw.wmnet with OS bullseye [16:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:23] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2006.codfw.wmnet with OS bullseye executed with errors: - aqs2006 (... [16:40:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2007.codfw.wmnet with OS bullseye [16:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:58] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2007.codfw.wmnet with OS bullseye [16:41:10] PROBLEM - Check systemd state on ores2002 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:38] wip node, downtiming --^ [16:41:40] PROBLEM - ores_workers_running on ores2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [16:42:12] (03CR) 10Andrew Bogott: Disable EventLogging extension (031 comment) [wikitech-static] - 10https://gerrit.wikimedia.org/r/788370 (owner: 10Andrew Bogott) [16:42:59] (03CR) 10Ottomata: [C: 03+1] Disable EventLogging extension [wikitech-static] - 10https://gerrit.wikimedia.org/r/788370 (owner: 10Andrew Bogott) [16:45:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2007.codfw.wmnet with reason: host reimage [16:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:00] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:48:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2006.codfw.wmnet with OS bullseye [16:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:15] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2006.codfw.wmnet with OS bullseye [16:48:30] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [16:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:39] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 09s) [16:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2007.codfw.wmnet with reason: host reimage [16:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:44] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:53:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2006.codfw.wmnet with reason: host reimage [16:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:14] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1001.eqiad.wmnet [16:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2006.codfw.wmnet with reason: host reimage [16:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220502T1700). [17:01:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2007.codfw.wmnet with OS bullseye [17:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:20] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2007.codfw.wmnet with OS bullseye completed: - aqs2007 (**PASS**)... [17:01:46] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:05:26] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1001.eqiad.wmnet [17:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2006.codfw.wmnet with OS bullseye [17:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:16] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2006.codfw.wmnet with OS bullseye completed: - aqs2006 (**PASS**)... [17:11:09] (03CR) 10Reedy: Disable EventLogging extension (031 comment) [wikitech-static] - 10https://gerrit.wikimedia.org/r/788370 (owner: 10Andrew Bogott) [17:14:20] (03CR) 10Nskaggs: "It's not clear to me what, if any, of the changes were made by a formatting tool. Can you clarify which changes were made utilizing a tool" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [17:24:57] (03Abandoned) 10Gergő Tisza: Video landing page: Don't show campaign body text on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787459 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [17:27:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2008.codfw.wmnet with OS bullseye [17:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:57] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2008.codfw.wmnet with OS bullseye [17:29:16] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10dancy) @Joe Pinging on this ticket. Outstanding issues: * I defi... [17:30:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2009.codfw.wmnet with OS bullseye [17:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:26] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2009.codfw.wmnet with OS bullseye [17:31:22] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) [17:33:34] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [17:35:06] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [17:35:16] 10SRE, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10colewhite) [17:35:43] 10SRE, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10colewhite) [17:35:46] 10SRE, 10serviceops: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10colewhite) [17:38:16] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1018.eqiad.wmnet [17:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:26] (03PS1) 10Andrew Bogott: Remove ref to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/788381 (https://phabricator.wikimedia.org/T296790) [17:40:46] (03CR) 10Dzahn: "Is it really an improvement to mail everyone instead of the service owners?" [puppet] - 10https://gerrit.wikimedia.org/r/788312 (owner: 10David Caro) [17:41:32] (03CR) 10Dzahn: "ah, this change is about traffic getting mail from cloud projects? maybe it should be configured based on $realm then" [puppet] - 10https://gerrit.wikimedia.org/r/788312 (owner: 10David Caro) [17:42:06] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [17:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:48] (03CR) 10Andrew Bogott: [C: 03+2] Remove ref to cloudvirt1018 [puppet] - 10https://gerrit.wikimedia.org/r/788381 (https://phabricator.wikimedia.org/T296790) (owner: 10Andrew Bogott) [17:42:50] (03CR) 10Dzahn: [C: 03+2] Planet: Update my (bawolff) blog url [puppet] - 10https://gerrit.wikimedia.org/r/787938 (owner: 10Brian Wolff) [17:44:59] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:28] (03CR) 10Dzahn: "Reedy, since you are the creator of the linked task but once said to stall it. Are you for this?" [puppet] - 10https://gerrit.wikimedia.org/r/776349 (https://phabricator.wikimedia.org/T257473) (owner: 10Zabe) [17:46:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1018.eqiad.wmnet [17:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:52] (03CR) 10Reedy: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/776349 (https://phabricator.wikimedia.org/T257473) (owner: 10Zabe) [17:49:57] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudvirt1018.eqiad.wmnet - https://phabricator.wikimedia.org/T296790 (10Andrew) a:03Cmjohnson [17:50:33] 10SRE, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Dzahn) We have to differentiate between: profile::etcd::tlsproxy and profile::etcd::v3 both have a sslcert::certificate but there is the comment ` # TLS certs *for etcd use* in... [17:50:48] !log ebernhardson@deploy1002 Started deploy [search/mjolnir/deploy@f94bb01]: T306123: adjust uploaded models to always have a positive score [17:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:52] T306123: Ensure mjolnir models have positive scores - https://phabricator.wikimedia.org/T306123 [17:51:33] !log ebernhardson@deploy1002 Finished deploy [search/mjolnir/deploy@f94bb01]: T306123: adjust uploaded models to always have a positive score (duration: 00m 45s) [17:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:06] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:54:58] 10SRE, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10colewhite) >>! In T307382#7897011, @Dzahn wrote: > We have to differentiate between: > > profile::etcd::tlsproxy > > and > > profile::etcd::v3 > > both have a sslcert::certificate bu... [17:59:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2008.codfw.wmnet with OS bullseye [17:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:26] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2008.codfw.wmnet with OS bullseye executed with errors: - aqs2008 (... [18:01:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2009.codfw.wmnet with OS bullseye [18:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:27] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2009.codfw.wmnet with OS bullseye executed with errors: - aqs2009 (... [18:06:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10Papaul) [18:06:29] 10SRE, 10ops-codfw, 10DC-Ops: mw2286 stuck after reboot - https://phabricator.wikimedia.org/T306823 (10Dzahn) Thank you! Server is still depooled though. Similarly to racking tasks this needs some agreement on the workflow. Like either the tickets should come back to us or we need to create new tickets for... [18:06:46] !log repooling mw2286 after hardware repair - T306823 [18:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:50] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2286.codfw.wmnet [18:06:50] T306823: mw2286 stuck after reboot - https://phabricator.wikimedia.org/T306823 [18:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2008.codfw.wmnet with OS bullseye [18:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:54] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2008.codfw.wmnet with OS bullseye [18:08:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2009.codfw.wmnet with OS bullseye [18:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:24] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2009.codfw.wmnet with OS bullseye [18:12:07] (03CR) 10Dzahn: [C: 03+1] "postgresql at wikimedia might not have an owner" [puppet] - 10https://gerrit.wikimedia.org/r/777433 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [18:12:45] (03PS5) 10Jforrester: TimedMediaHandler: Drop Beta Feature, no longer usable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612350 (https://phabricator.wikimedia.org/T248418) [18:12:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2008.codfw.wmnet with reason: host reimage [18:12:47] (03PS5) 10Jforrester: TimedMediaHandler: Don't read wmgTmhWebPlayer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612351 (https://phabricator.wikimedia.org/T248418) [18:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:49] (03PS5) 10Jforrester: TimedMediaHandler: Drop pre-switch config, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612352 (https://phabricator.wikimedia.org/T248418) [18:12:51] (03PS1) 10Jforrester: TimedMediaHandler: Disabled the BetaFeature from wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788385 (https://phabricator.wikimedia.org/T248418) [18:13:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2009.codfw.wmnet with reason: host reimage [18:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10Papaul) [18:14:10] 10SRE, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Dzahn) Things done in reaction to the page on the weekend: add new certificate for etcd-v3.eqiad.wmnet - https://gerrit.wikimedia.org/r/c/operations/puppet/+/787884 hiera: tlsproxy: use n... [18:16:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2008.codfw.wmnet with reason: host reimage [18:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:12] (03CR) 10Dzahn: [C: 03+2] Periodically run purgeExpiredBlocks.php on small wikis [puppet] - 10https://gerrit.wikimedia.org/r/776349 (https://phabricator.wikimedia.org/T257473) (owner: 10Zabe) [18:19:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2009.codfw.wmnet with reason: host reimage [18:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:27] (03CR) 10Dzahn: [C: 03+2] "job has been created on mwmaint hosts" [puppet] - 10https://gerrit.wikimedia.org/r/776349 (https://phabricator.wikimedia.org/T257473) (owner: 10Zabe) [18:24:31] !log [mwmaint1002:~] $ sudo systemctl start mediawiki_job_purge_expired_blocks [18:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:47] !log [mwmaint1002:~] $ sudo systemctl start mediawiki_job_purge_expired_blocks - starting new timer for T257473 [18:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:51] T257473: Periodically run purgeExpiredBlocks.php maintenance script - https://phabricator.wikimedia.org/T257473 [18:27:56] (03PS1) 10Ebernhardson: cirrus: Update MLR models to 20220421 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788386 (https://phabricator.wikimedia.org/T306123) [18:28:05] 10SRE, 10WMF-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-maintenance-script-run: Periodically run purgeExpiredBlocks.php maintenance script - https://phabricator.wikimedia.org/T257473 (10Dzahn) ` [mwmaint1002:~] $ sudo systemctl status mediawiki_job_purge_expired_blocks ● mediawiki_job_purge_expire... [18:28:39] 10SRE, 10WMF-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-maintenance-script-run: Periodically run purgeExpiredBlocks.php maintenance script - https://phabricator.wikimedia.org/T257473 (10Dzahn) 05Open→03Resolved a:03Dzahn ` [mwmaint1002:~] $ sudo systemctl status mediawiki_job_purge_expired_b... [18:28:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2008.codfw.wmnet with OS bullseye [18:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:50] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2008.codfw.wmnet with OS bullseye completed: - aqs2008 (**PASS**)... [18:29:10] RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2009.codfw.wmnet with OS bullseye [18:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:35] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2009.codfw.wmnet with OS bullseye completed: - aqs2009 (**PASS**)... [18:34:32] (03PS1) 10Herron: watchrat: match jobs 'blackbox/watchrat.*' [alerts] - 10https://gerrit.wikimedia.org/r/788387 (https://phabricator.wikimedia.org/T303803) [18:35:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2010.codfw.wmnet with OS bullseye [18:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:36] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2010.codfw.wmnet with OS bullseye [18:37:07] (03CR) 10Herron: [C: 03+2] watchrat: match jobs 'blackbox/watchrat.*' [alerts] - 10https://gerrit.wikimedia.org/r/788387 (https://phabricator.wikimedia.org/T303803) (owner: 10Herron) [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:39:06] (03Merged) 10jenkins-bot: watchrat: match jobs 'blackbox/watchrat.*' [alerts] - 10https://gerrit.wikimedia.org/r/788387 (https://phabricator.wikimedia.org/T303803) (owner: 10Herron) [18:41:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2011.codfw.wmnet with OS bullseye [18:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:58] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2011.codfw.wmnet with OS bullseye [18:43:31] (03CR) 10Jdlrobson: "Since Vector 2022 is an optin skin on Chinese Wikipedia, this is a new feature I don't think this needs to be backported. It can roll out " [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788329 (https://phabricator.wikimedia.org/T307298) (owner: 10Winston Sung) [18:44:37] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:00] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:50:16] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:51:41] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:06:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2010.codfw.wmnet with OS bullseye [19:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:26] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2010.codfw.wmnet with OS bullseye executed with errors: - aqs2010 (... [19:12:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2011.codfw.wmnet with OS bullseye [19:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:56] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2011.codfw.wmnet with OS bullseye executed with errors: - aqs2011 (... [19:14:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2010.codfw.wmnet with OS bullseye [19:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:17] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2010.codfw.wmnet with OS bullseye [19:14:26] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1010.eqiad.wmnet [19:14:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2011.codfw.wmnet with OS bullseye [19:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:32] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2011.codfw.wmnet with OS bullseye [19:14:38] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2011.codfw.wmnet with OS bullseye [19:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:44] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2011.codfw.wmnet with OS bullseye executed with errors: - aqs2011 (... [19:15:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2011.codfw.wmnet with OS bullseye [19:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:03] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2011.codfw.wmnet with OS bullseye [19:16:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2011.codfw.wmnet with OS bullseye [19:16:07] marostegui: would you kindly set me as the person on clinic duty? [19:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:11] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2011.codfw.wmnet with OS bullseye executed with errors: - aqs2011 (... [19:18:09] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2001.codfw.wmnet [19:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2010.codfw.wmnet with reason: host reimage [19:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:32] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1010.eqiad.wmnet [19:19:33] (03CR) 10Andrew Bogott: [C: 03+2] P:toolforge::prometheus: add toolsbeta support [puppet] - 10https://gerrit.wikimedia.org/r/788305 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [19:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:45] (03CR) 10Andrew Bogott: [C: 03+2] P:toolforge::prometheus: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/788304 (owner: 10Majavah) [19:22:05] (03PS3) 10Stang: zhwiki: Update zh-hans version tagline and wordmark files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787754 (https://phabricator.wikimedia.org/T276694) [19:22:41] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1011.eqiad.wmnet [19:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:16] (03Abandoned) 10Andrew Bogott: Disable EventLogging extension [wikitech-static] - 10https://gerrit.wikimedia.org/r/788370 (owner: 10Andrew Bogott) [19:23:53] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2001.codfw.wmnet [19:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2010.codfw.wmnet with reason: host reimage [19:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:42] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2002.codfw.wmnet [19:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:03] (03PS2) 10Andrew Bogott: maintain-views: remove user_options column from user table [puppet] - 10https://gerrit.wikimedia.org/r/782017 (owner: 10Zabe) [19:26:21] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:26:52] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1011.eqiad.wmnet [19:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:34] (03CR) 10Andrew Bogott: [C: 03+2] maintain-views: remove user_options column from user table [puppet] - 10https://gerrit.wikimedia.org/r/782017 (owner: 10Zabe) [19:28:29] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1012.eqiad.wmnet [19:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:42] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2002.codfw.wmnet [19:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2011.codfw.wmnet with OS bullseye [19:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:51] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2011.codfw.wmnet with OS bullseye [19:30:44] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2003.codfw.wmnet [19:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host pki2002.mgmt.codfw.wmnet with reboot policy FORCED [19:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:04] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1012.eqiad.wmnet [19:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:36] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) Jin will be onsite on May 4th @ 9AM Singapore Time to swap this memory out Order Number - 1-216864938761 [19:34:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2011.codfw.wmnet with reason: host reimage [19:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:53] (03CR) 10Andrew Bogott: [C: 03+2] graphite: migrate archiver crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/751207 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [19:36:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2010.codfw.wmnet with OS bullseye [19:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:51] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2010.codfw.wmnet with OS bullseye completed: - aqs2010 (**PASS**)... [19:37:08] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2003.codfw.wmnet [19:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2011.codfw.wmnet with reason: host reimage [19:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:20] hashar: any chance you could update the topic to set me as being on clinic duty? [19:44:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2012.codfw.wmnet with OS bullseye [19:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:08] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2012.codfw.wmnet with OS bullseye [19:47:16] (03CR) 10Andrew Bogott: "This all looks fine to me. I'd like to see a bit of explanation and example returns for the three new run_one variants. And if (as balloon" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [19:49:05] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1023.eqiad.wmnet [19:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:11] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2023.codfw.wmnet [19:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:23] (03PS1) 10Razzi: ssl: add superset-next.wikimedia.org to yarn.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/788396 (https://phabricator.wikimedia.org/T275575) [19:50:18] (03CR) 10Razzi: "OK the cert generation went fine, now this updates the .crt in puppet." [puppet] - 10https://gerrit.wikimedia.org/r/788396 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [19:50:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2011.codfw.wmnet with OS bullseye [19:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:56] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2011.codfw.wmnet with OS bullseye completed: - aqs2011 (**PASS**)... [19:52:18] (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:53:33] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) [19:53:40] (03CR) 10Razzi: [C: 03+2] ssl: add superset-next.wikimedia.org to yarn.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/788396 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [19:55:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pki2002.mgmt.codfw.wmnet with reboot policy FORCED [19:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:56:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host krb2002.mgmt.codfw.wmnet with reboot policy FORCED [19:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:18] (ProbeDown) resolved: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:57:32] (03CR) 10Ottomata: [C: 03+1] Stream configs for newly migrated android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783874 (https://phabricator.wikimedia.org/T306385) (owner: 10Sharvaniharan) [19:58:41] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host logstash2023.codfw.wmnet [19:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:50] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1023.eqiad.wmnet [19:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:28] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1024.eqiad.wmnet [19:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:35] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2024.codfw.wmnet [19:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] RoanKattouw, Urbanecm, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220502T2000). [20:00:04] tgr, koi, ebernhardson, jdlrobson, and jdrewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] present [20:00:20] hi [20:00:21] equally [20:00:39] o/ [20:00:51] Hello everyone, I can deploy [20:01:43] (03CR) 10Catrope: [C: 03+2] Video landing page: Show different title/body text on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788336 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:02:22] (03CR) 10Catrope: [C: 04-1] "This has conflict markers in it" [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788338 (https://phabricator.wikimedia.org/T307271) (owner: 10Jdlrobson) [20:02:41] Jdlrobson: Your patch has conflict markers in it, please resolve the rebase conflicts ---^^ [20:03:10] thanks RoanKattouw! The GrowthExperiments one will need a sync-world. [20:03:13] (03CR) 10Catrope: [C: 03+2] zhwiki: Update zh-hans version tagline and wordmark files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787754 (https://phabricator.wikimedia.org/T276694) (owner: 10Stang) [20:03:25] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1024.eqiad.wmnet [20:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:34] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash2024.codfw.wmnet [20:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:51] (03PS2) 10Ebernhardson: cirrus: Update MLR models to 20220421 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788386 (https://phabricator.wikimedia.org/T306123) [20:03:55] RoanKattouw: looking [20:03:59] (03Merged) 10jenkins-bot: zhwiki: Update zh-hans version tagline and wordmark files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787754 (https://phabricator.wikimedia.org/T276694) (owner: 10Stang) [20:04:04] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash1025.eqiad.wmnet [20:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:08] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host logstash2025.codfw.wmnet [20:04:11] (03CR) 10jerkins-bot: [V: 04-1] [TOC] Remove pointer-events:none on .sidebar-toc-link [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788338 (https://phabricator.wikimedia.org/T307271) (owner: 10Jdlrobson) [20:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:14] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1025.eqiad.wmnet, logstash1024.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:04:38] koi: Your patch is on mwdebug1002, please test (not sure if wordmark stuff can be tested there though) [20:04:46] 10SRE, 10Scap: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 (10Peachey88) [20:04:51] (03PS2) 10Jdlrobson: [TOC] Remove pointer-events:none on .sidebar-toc-link [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788338 (https://phabricator.wikimedia.org/T307271) [20:04:51] got it, looking [20:04:53] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@9fa5d7e]: Fix app_session_metrics [airflow-dags/analytics@9fa5d7e] [20:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:03] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@9fa5d7e]: Fix app_session_metrics [airflow-dags/analytics@9fa5d7e] (duration: 00m 09s) [20:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:52] RoanKattouw, LGTM [20:05:55] (03CR) 10Catrope: [C: 03+2] [TOC] Remove pointer-events:none on .sidebar-toc-link [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788338 (https://phabricator.wikimedia.org/T307271) (owner: 10Jdlrobson) [20:07:18] (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:07:33] !log catrope@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-wordmark-zh-hans.svg: Config: [[gerrit:787754|zhwiki: Update zh-hans version tagline and wordmark files (T276694)]] (duration: 00m 48s) [20:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:38] T276694: Simplified Chinese logo of zhwiki was overrided by an old version - https://phabricator.wikimedia.org/T276694 [20:08:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:21] !log catrope@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-tagline-zh-hans.svg: Config: [[gerrit:787754|zhwiki: Update zh-hans version tagline and wordmark files (T276694)]] (duration: 00m 47s) [20:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:09:04] RoanKattouw, hey, is it ok when I add a patch to the window? [20:09:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:09] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:787754|zhwiki: Update zh-hans version tagline and wordmark files (T276694)]] (duration: 00m 47s) [20:09:10] zabe: Sure! [20:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:15] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host logstash1025.eqiad.wmnet [20:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:38] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:09:39] (03PS29) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [20:10:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:18] koi: Yours should be live now [20:10:49] (03PS3) 10Catrope: cirrus: Update MLR models to 20220421 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788386 (https://phabricator.wikimedia.org/T306123) (owner: 10Ebernhardson) [20:10:53] (03CR) 10Catrope: [C: 03+2] cirrus: Update MLR models to 20220421 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788386 (https://phabricator.wikimedia.org/T306123) (owner: 10Ebernhardson) [20:10:58] yeah, looks great now, thanks! [20:11:44] (03Merged) 10jenkins-bot: cirrus: Update MLR models to 20220421 deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788386 (https://phabricator.wikimedia.org/T306123) (owner: 10Ebernhardson) [20:12:18] (ProbeDown) resolved: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:12:19] ebernhardson: Your patch is on mwdebug1002, please test (to the extent possible) [20:12:50] RoanKattouw: kk [20:13:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:13:37] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host logstash2025.codfw.wmnet [20:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:47] RoanKattouw: all looks reasonable [20:14:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host krb2002.mgmt.codfw.wmnet with reboot policy FORCED [20:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:44] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788386|cirrus: Update MLR models to 20220421 deployment (T306123)]] (duration: 00m 48s) [20:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:48] T306123: Ensure mjolnir models have positive scores - https://phabricator.wikimedia.org/T306123 [20:15:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs2012.codfw.wmnet with OS bullseye [20:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:24] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2012.codfw.wmnet with OS bullseye executed with errors: - aqs2012 (... [20:16:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:16:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:32] (03PS2) 10Catrope: Revert "Revert "Start writing to cuc_actor in guwwiki and shnwikivoyage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779922 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:18:36] (03CR) 10Catrope: [C: 03+2] Revert "Revert "Start writing to cuc_actor in guwwiki and shnwikivoyage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779922 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:19:47] (03Merged) 10jenkins-bot: Revert "Revert "Start writing to cuc_actor in guwwiki and shnwikivoyage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779922 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:20:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2012.codfw.wmnet with OS bullseye [20:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:28] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host aqs2012.codfw.wmnet with OS bullseye [20:20:32] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/788404 (https://phabricator.wikimedia.org/T306792) [20:20:55] (03PS30) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [20:21:08] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/788404 (https://phabricator.wikimedia.org/T306792) (owner: 10Kosta Harlan) [20:22:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:22:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2012.codfw.wmnet with reason: host reimage [20:25:17] zabe: Your patch is on mwdebug1002, please test [20:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:16] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/788404 (https://phabricator.wikimedia.org/T306792) (owner: 10Kosta Harlan) [20:26:25] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:27:16] RoanKattouw, lgtm [20:27:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) I agree with @Majavah's assessment, although I wouldn't promise that there aren't other edge cases where we are rely... [20:28:04] (03Merged) 10jenkins-bot: Video landing page: Show different title/body text on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788336 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:28:06] (03Merged) 10jenkins-bot: [TOC] Remove pointer-events:none on .sidebar-toc-link [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/788338 (https://phabricator.wikimedia.org/T307271) (owner: 10Jdlrobson) [20:28:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) @ayounsi please also be aware that our team of five SREs is currently down to three. This means we will have, if any... [20:28:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2012.codfw.wmnet with reason: host reimage [20:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:36] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:779922|Revert "Revert "Start writing to cuc_actor in guwwiki and shnwikivoyage"" (T233004)]] (duration: 00m 47s) [20:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:40] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:31:50] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [20:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:17] Jdlrobson and tgr: Your changes are on mwdebug1002, please test (the i18n parts of tgr's change probably won't work until I run sync-world though) [20:32:39] (03CR) 10David Caro: wmcs: Fix types and associated code refactor (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/788351 (owner: 10David Caro) [20:32:50] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [20:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:24] RoanKattouw: on it [20:34:10] RoanKattouw: tested. works [20:34:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Andrew) >>! In T305414#7858338, @ayounsi wrote: > Thanks, from what I understand moving those hosts to private IPs are much shorter term goals... [20:37:25] RoanKattouw: not what I expected, but looks good enough to sync, I'll test the i18n part after that [20:37:34] OK, will sync now [20:38:01] !log catrope@deploy1002 Started scap: Backport: [[gerrit:788338|[TOC] Remove pointer-events:none on .sidebar-toc-link (T307271)]] and [[gerrit:788336|Video landing page: Show different title/body text on mobile (T303785)]] [20:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:07] T303785: Account creation: social media landing pages - https://phabricator.wikimedia.org/T303785 [20:38:08] T307271: Links in the TOC of vector 2022 are not clickable for some Chromium based browsers - https://phabricator.wikimedia.org/T307271 [20:40:09] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [20:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:36] Thanks RoanKattouw [20:40:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2012.codfw.wmnet with OS bullseye [20:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:47] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host aqs2012.codfw.wmnet with OS bullseye completed: - aqs2012 (**PASS**)... [20:41:33] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) [20:42:29] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) 05Open→03Resolved @Eevans this is complete [20:42:32] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [20:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:28] (03PS31) 10Ryan Kemper: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:44:27] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [20:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:27] (03PS32) 10Ryan Kemper: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:45:38] ugh. Apparently the code had two errors which cancel each other out. With the i18n message missing on mwdebug, only one of those errors happen. [20:46:16] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [20:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:45] 2 errors that cancel each other out sounds better than it probably should [20:49:46] !log catrope@deploy1002 Finished scap: Backport: [[gerrit:788338|[TOC] Remove pointer-events:none on .sidebar-toc-link (T307271)]] and [[gerrit:788336|Video landing page: Show different title/body text on mobile (T303785)]] (duration: 11m 45s) [20:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:51] T303785: Account creation: social media landing pages - https://phabricator.wikimedia.org/T303785 [20:49:52] T307271: Links in the TOC of vector 2022 are not clickable for some Chromium based browsers - https://phabricator.wikimedia.org/T307271 [20:51:18] tgr, Jdlrobson: Your patches are deployed now, let me know if you need to deploy any additional fixes [20:52:46] thanks RoanKattouw! It's good as it is for now [20:52:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:52:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:43] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:56:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] Reedy, sbassett, Maryum, and manfredi: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220502T2100). [21:08:43] (03CR) 10Razzi: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/666481 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [21:08:47] (03CR) 10jerkins-bot: [V: 04-1] Configure superset-next.wikimedia.org domain [puppet] - 10https://gerrit.wikimedia.org/r/666481 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [21:10:18] (03PS3) 10Razzi: Configure superset-next.wikimedia.org domain [puppet] - 10https://gerrit.wikimedia.org/r/666481 (https://phabricator.wikimedia.org/T275575) [21:11:54] (03PS3) 10Razzi: Add superset-next domain CNAME [dns] - 10https://gerrit.wikimedia.org/r/774537 (https://phabricator.wikimedia.org/T275575) [21:31:26] (03PS4) 10Razzi: Configure superset-next.wikimedia.org domain [puppet] - 10https://gerrit.wikimedia.org/r/666481 (https://phabricator.wikimedia.org/T275575) [21:32:27] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35031/console" [puppet] - 10https://gerrit.wikimedia.org/r/666481 (https://phabricator.wikimedia.org/T275575) (owner: 10Razzi) [21:40:20] 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10RobH) [21:43:39] 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10RobH) [21:47:50] (03PS33) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [21:48:34] (03CR) 10jerkins-bot: [V: 04-1] Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [21:51:39] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:55:18] (03PS1) 10Andrea Denisse: add Andrea Denisse Gómez-Martínez [puppet] - 10https://gerrit.wikimedia.org/r/788431 [21:55:19] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:01:21] (03PS34) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [22:02:09] (03CR) 10jerkins-bot: [V: 04-1] Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [22:03:43] 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10RobH) [22:03:54] 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10RobH) [22:05:21] (03PS35) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [22:06:06] (03CR) 10jerkins-bot: [V: 04-1] Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [22:10:01] (03PS36) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [22:12:21] 10SRE-OnFire-Incident-Docs, 10Observability-Alerting, 10serviceops-radar, 10Sustainability (Incident Followup), 10Wikimedia-Incident: Certificate expiration monitoring - https://phabricator.wikimedia.org/T307383 (10Dzahn) [22:14:13] (03PS1) 10Papaul: Add nes pki and krb node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/788433 (https://phabricator.wikimedia.org/T305489) [22:14:36] (03PS37) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [22:14:46] (03CR) 10jerkins-bot: [V: 04-1] Add nes pki and krb node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/788433 (https://phabricator.wikimedia.org/T305489) (owner: 10Papaul) [22:15:53] (03CR) 10Dzahn: "beware: the letter "c" in first line would break site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/788433 (https://phabricator.wikimedia.org/T305489) (owner: 10Papaul) [22:20:12] 10SRE, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Dzahn) https://wikitech.wikimedia.org/wiki/Incidents/2022-05-01_etcd [22:22:14] (03PS2) 10Papaul: Add new pki and krb node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/788433 (https://phabricator.wikimedia.org/T305489) [22:22:47] (03CR) 10jerkins-bot: [V: 04-1] Add new pki and krb node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/788433 (https://phabricator.wikimedia.org/T305489) (owner: 10Papaul) [22:23:37] (03PS2) 10Andrea Denisse: add Andrea Denisse Gómez-Martínez [puppet] - 10https://gerrit.wikimedia.org/r/788431 [22:27:13] (03CR) 10JHathaway: [C: 03+2] add Andrea Denisse Gómez-Martínez [puppet] - 10https://gerrit.wikimedia.org/r/788431 (owner: 10Andrea Denisse) [22:27:47] (03PS3) 10Papaul: Add new pki and krb node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/788433 (https://phabricator.wikimedia.org/T305489) [22:27:51] (03PS1) 10Cwhite: profile: add etcd tlsproxy certificate monitoring [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) [22:27:56] (03PS4) 10Papaul: Add new pki and krb node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/788433 (https://phabricator.wikimedia.org/T305489) [22:28:33] (03CR) 10jerkins-bot: [V: 04-1] profile: add etcd tlsproxy certificate monitoring [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) (owner: 10Cwhite) [22:29:34] (03PS5) 10Juan90264: Fix: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785208 (https://phabricator.wikimedia.org/T303577) [22:29:47] (03PS6) 10Juan90264: Fix: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785208 (https://phabricator.wikimedia.org/T303577) [22:30:12] (03CR) 10Papaul: [C: 03+2] Add new pki and krb node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/788433 (https://phabricator.wikimedia.org/T305489) (owner: 10Papaul) [22:30:31] (03PS2) 10Cwhite: profile: add etcd tlsproxy certificate monitoring [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) [22:32:12] (03PS7) 10Juan90264: Fix: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785208 (https://phabricator.wikimedia.org/T303577) [22:32:24] (03CR) 10jerkins-bot: [V: 04-1] profile: add etcd tlsproxy certificate monitoring [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) (owner: 10Cwhite) [22:34:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host pki2002.codfw.wmnet with OS bullseye [22:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host pki2002.codfw.wmnet with OS bullseye [22:36:28] (03PS1) 10Cwhite: add snakoil etcd-v3.eqiad.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/788436 [22:36:43] (03CR) 10Dzahn: "I think it will fail with duplicate definitions if the check name and description are the same for multiple hosts. We gotta use the hostna" [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) (owner: 10Cwhite) [22:37:12] (03PS3) 10Cwhite: profile: add etcd tlsproxy certificate monitoring [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:38:15] (03CR) 10Cwhite: [V: 03+2 C: 03+2] add snakoil etcd-v3.eqiad.wmnet.key [labs/private] - 10https://gerrit.wikimedia.org/r/788436 (owner: 10Cwhite) [22:39:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10Papaul) [22:39:14] (03PS1) 10Dzahn: etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) [22:39:26] (03PS5) 10Razzi: Configure superset-next.wikimedia.org domain [puppet] - 10https://gerrit.wikimedia.org/r/666481 (https://phabricator.wikimedia.org/T275575) [22:39:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10Papaul) [22:40:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host krb2002.codfw.wmnet with OS bullseye [22:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host krb2002.codfw.wmnet with OS bullseye [22:41:07] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1001/35033/" [puppet] - 10https://gerrit.wikimedia.org/r/788435 (https://phabricator.wikimedia.org/T307383) (owner: 10Cwhite) [22:41:15] (03CR) 10jerkins-bot: [V: 04-1] etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [22:42:44] (03CR) 10Dzahn: ""invalid secret certificates/etcd.eqiad.wmnet/etcd.eqiad.wmnet.key.private.pem" is missing when switching to use_cergen" [labs/private] - 10https://gerrit.wikimedia.org/r/788436 (owner: 10Cwhite) [22:43:43] (03PS2) 10Krinkle: clinic-duty: stop using 'document' to make tests pass [software] - 10https://gerrit.wikimedia.org/r/788297 (owner: 10Filippo Giunchedi) [22:43:45] (03PS2) 10Krinkle: clinic-duty: add Orange support [software] - 10https://gerrit.wikimedia.org/r/788296 (owner: 10Filippo Giunchedi) [22:43:49] (03CR) 10Krinkle: [C: 03+1] clinic-duty: stop using 'document' to make tests pass [software] - 10https://gerrit.wikimedia.org/r/788297 (owner: 10Filippo Giunchedi) [22:46:15] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Dzahn) https://gerrit.wikimedia.org/r/c/labs/private/+/788436/ [22:46:23] (03PS1) 10Dzahn: add fake certificates for etcd-v3.eqiad and etcd-v3.codfw [labs/private] - 10https://gerrit.wikimedia.org/r/788439 (https://phabricator.wikimedia.org/T307382) [22:48:00] (03PS2) 10Dzahn: add fake certificates for etcd-v3.eqiad and etcd-v3.codfw [labs/private] - 10https://gerrit.wikimedia.org/r/788439 (https://phabricator.wikimedia.org/T307382) [22:49:37] (03CR) 10Cwhite: "Probably also needs etcd-v3.(eqiad|codfw).wmnet.key.private.pem if wanting to use PCC to test use_cergen => true?" [labs/private] - 10https://gerrit.wikimedia.org/r/788439 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [22:49:39] (03PS1) 10Andrea Denisse: Use diff --color instead of colordiff as colordiff is not standard [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/788440 [22:52:45] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:52:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pki2002.codfw.wmnet with reason: host reimage [22:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:25] (03CR) 10Krinkle: clinic-duty: add Orange support (031 comment) [software] - 10https://gerrit.wikimedia.org/r/788296 (owner: 10Filippo Giunchedi) [22:56:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki2002.codfw.wmnet with reason: host reimage [22:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:07] (03CR) 10Dzahn: [C: 03+1] "confirmed. on a standard bullseye install I do not have colordiff installed by default. I do have diff though and diff --color output look" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/788440 (owner: 10Andrea Denisse) [22:59:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on krb2002.codfw.wmnet with reason: host reimage [22:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:02:14] (03PS2) 10Andrea Denisse: Use diff --color instead of colordiff as colordiff is not standard [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/788440 [23:02:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on krb2002.codfw.wmnet with reason: host reimage [23:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:58] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/777888 (https://phabricator.wikimedia.org/T305088) (owner: 10Cwhite) [23:03:39] (03CR) 10Cwhite: [C: 03+2] logstash: rewrite ecs settings [puppet] - 10https://gerrit.wikimedia.org/r/777887 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite) [23:03:46] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:06:40] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/777891 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [23:08:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki2002.codfw.wmnet with OS bullseye [23:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host pki2002.codfw.wmnet with OS bullseye completed: - pki2002 (**... [23:09:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10Papaul) [23:10:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install pki2002 - https://phabricator.wikimedia.org/T305489 (10Papaul) 05Open→03Resolved @jbond this is complete [23:15:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host krb2002.codfw.wmnet with OS bullseye [23:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host krb2002.codfw.wmnet with OS bullseye completed: - krb2002 (**... [23:17:26] (03PS3) 10Dzahn: add fake certificates and keys for etcd-v3.eqiad and etcd-v3.codfw [labs/private] - 10https://gerrit.wikimedia.org/r/788439 (https://phabricator.wikimedia.org/T307382) [23:19:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10Papaul) [23:19:08] (03CR) 10Cwhite: [C: 03+1] add fake certificates and keys for etcd-v3.eqiad and etcd-v3.codfw [labs/private] - 10https://gerrit.wikimedia.org/r/788439 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [23:19:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q4:(Need By: TBD) rack/setup/install krb2002 - https://phabricator.wikimedia.org/T305488 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff this is complete [23:19:58] (03CR) 10Dzahn: "it's not exactly like the other services in this directory. either they have fake versions of ALL the files in private or only a fake key " [labs/private] - 10https://gerrit.wikimedia.org/r/788439 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [23:29:21] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:45:22] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [23:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale