[00:00:26] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22161 and previous config saved to /var/cache/conftool/dbconfig/20220309-000025-marostegui.json [00:00:28] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:30] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [00:00:31] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:13] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [00:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:16] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [00:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:51] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1182 (T298294)', diff saved to https://phabricator.wikimedia.org/P22162 and previous config saved to /var/cache/conftool/dbconfig/20220309-000250-marostegui.json [00:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:01] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298294)', diff saved to https://phabricator.wikimedia.org/P22163 and previous config saved to /var/cache/conftool/dbconfig/20220309-000600-marostegui.json [00:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:04] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [00:09:14] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:46] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:36] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P22164 and previous config saved to /var/cache/conftool/dbconfig/20220309-002135-marostegui.json [00:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:11] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P22165 and previous config saved to /var/cache/conftool/dbconfig/20220309-003710-marostegui.json [00:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:46] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298294)', diff saved to https://phabricator.wikimedia.org/P22166 and previous config saved to /var/cache/conftool/dbconfig/20220309-005245-marostegui.json [00:52:48] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [00:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:51] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [00:52:51] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [00:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:26] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1105:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22167 and previous config saved to /var/cache/conftool/dbconfig/20220309-005325-marostegui.json [00:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:47] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22168 and previous config saved to /var/cache/conftool/dbconfig/20220309-010146-marostegui.json [01:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:50] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [01:14:50] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, datahubsearch1001, datahubsearch1002, datahubsearch1003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [01:17:22] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22169 and previous config saved to /var/cache/conftool/dbconfig/20220309-011721-marostegui.json [01:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:57] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P22170 and previous config saved to /var/cache/conftool/dbconfig/20220309-013256-marostegui.json [01:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:30] (JobUnavailable) firing: (4) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:48:32] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22171 and previous config saved to /var/cache/conftool/dbconfig/20220309-014831-marostegui.json [01:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:36] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [02:08:56] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:20:48] (03PS1) 10C. Scott Ananian: Ensure that the recognizedTagData static cache is properly initialized [core] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768804 (https://phabricator.wikimedia.org/T303360) [02:25:40] just tried logging in to wikitech, got `[a434adad-c70b-4126-8ddc-b1134324a762] 2022-03-09 02:25:19: Fatal exception of type "MWException"` [02:26:33] CAS update failed on user_touched. The version of the user to be saved is older than the current version. [02:26:45] has there been a recent issue with these? [03:02:06] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wik [03:04:48] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:10:36] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:22:08] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:30:52] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:39:22] RECOVERY - Check unit status of geoip_update_legacy on puppetmaster1001 is OK: OK: Status of the systemd unit geoip_update_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:13:18] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (7) node(s) change every puppet run: cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, cumin2002, datahubsearch1001, datahubsearch1002, datahubsearch1003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:34:58] PROBLEM - SSH on kubernetes2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:34:15] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:56] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [06:02:39] PROBLEM - Host ms-fe1012 is DOWN: PING CRITICAL - Packet loss = 100% [06:06:03] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:05] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:33] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [06:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:36] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [06:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:06] (03PS4) 10Marostegui: mariadb: Promote db1107 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/768954 (https://phabricator.wikimedia.org/T302190) [06:20:11] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1146:3312 (T298294)', diff saved to https://phabricator.wikimedia.org/P22172 and previous config saved to /var/cache/conftool/dbconfig/20220309-062010-marostegui.json [06:20:13] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:14] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [06:36:20] RECOVERY - SSH on kubernetes2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:38:41] (03PS1) 10Marostegui: db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/769278 (https://phabricator.wikimedia.org/T300600) [06:40:27] (03CR) 10Marostegui: [C: 03+2] db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/769278 (https://phabricator.wikimedia.org/T300600) (owner: 10Marostegui) [06:43:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1123.eqiad.wmnet with OS bullseye [06:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:54] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:52:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1123.eqiad.wmnet with reason: host reimage [06:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:49] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22173 and previous config saved to /var/cache/conftool/dbconfig/20220309-065447-root.json [06:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1123.eqiad.wmnet with reason: host reimage [06:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:51] (03PS1) 10Marostegui: Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769286 [07:09:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1123.eqiad.wmnet with OS bullseye [07:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:15] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22174 and previous config saved to /var/cache/conftool/dbconfig/20220309-071014-root.json [07:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P22175 and previous config saved to /var/cache/conftool/dbconfig/20220309-071153-root.json [07:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:34] (03CR) 10Marostegui: [C: 03+2] Revert "db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/769286 (owner: 10Marostegui) [07:17:49] (03PS1) 10Elukey: Add role::insetup for ms-be1012 [puppet] - 10https://gerrit.wikimedia.org/r/769382 (https://phabricator.wikimedia.org/T294137) [07:19:18] (03PS2) 10Elukey: Add role::insetup for ms-fe1012 [puppet] - 10https://gerrit.wikimedia.org/r/769382 (https://phabricator.wikimedia.org/T294137) [07:20:13] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:21:57] (03PS1) 10Marostegui: phabricator.my.cnf: Remove innodb_buffer_pool_instances flag [puppet] - 10https://gerrit.wikimedia.org/r/769383 (https://phabricator.wikimedia.org/T301879) [07:22:22] (03CR) 10Wiphawrrnb63: [C: 03+1] Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243728 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [07:22:39] (03CR) 10Marostegui: [C: 03+2] phabricator.my.cnf: Remove innodb_buffer_pool_instances flag [puppet] - 10https://gerrit.wikimedia.org/r/769383 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [07:25:41] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22176 and previous config saved to /var/cache/conftool/dbconfig/20220309-072540-root.json [07:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 15%: After schema change', diff saved to https://phabricator.wikimedia.org/P22177 and previous config saved to /var/cache/conftool/dbconfig/20220309-072656-root.json [07:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:19] (03PS1) 10Marostegui: change_page_id_T300380.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769384 (https://phabricator.wikimedia.org/T300380) [07:30:44] (03CR) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763557 (owner: 10Giuseppe Lavagetto) [07:30:50] (03PS10) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [07:31:01] (03PS2) 10Marostegui: change_pr_page_T300380.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769384 (https://phabricator.wikimedia.org/T300380) [07:31:03] !log manually sync pcc facts following https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Manually_update_production [07:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:39] !log dbmaint on db1123 s3@eqiad T300600 [07:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:43] T300600: Upgrade s3 to Bullseye - https://phabricator.wikimedia.org/T300600 [07:34:50] !log dbmaint on s7@eqiad T300775 [07:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:53] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [07:41:08] !log marostegui@cumin2002 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P22178 and previous config saved to /var/cache/conftool/dbconfig/20220309-074107-root.json [07:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:57] !log dbmaint on s1 T300380 [07:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:00] T300380: Make page_restrictions.pr_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300380 [07:42:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22179 and previous config saved to /var/cache/conftool/dbconfig/20220309-074200-root.json [07:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:07] !log dbmaint on s6 T300380 [07:42:09] !log dbmaint on s5 T300380 [07:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:26] !log dbmaint on s5@eqiad T300380 [07:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:29] !log dbmaint on s6@eqiad T300380 [07:42:30] !log dbmaint on s1@eqiad T300380 [07:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:23] (03PS1) 10KartikMistry: Enable SectionTranslation on Javanese, Tagalog, Mongolian, Telugu WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769386 (https://phabricator.wikimedia.org/T298237) [07:49:52] !log dbmaint on s4@eqiad T300380 [07:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:55] !log dbmaint on s8@eqiad T300380 [07:49:55] T300380: Make page_restrictions.pr_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300380 [07:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:27] !log dbmaint on s2@eqiad T300380 [07:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:31] T300380: Make page_restrictions.pr_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300380 [07:57:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 40%: After schema change', diff saved to https://phabricator.wikimedia.org/P22180 and previous config saved to /var/cache/conftool/dbconfig/20220309-075704-root.json [07:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:51] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34151/" [puppet] - 10https://gerrit.wikimedia.org/r/769382 (https://phabricator.wikimedia.org/T294137) (owner: 10Elukey) [08:00:04] Amir1, awight, Urbanecm, and taavi: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T0800). [08:00:04] awight: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:30] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1123.eqiad.wmnet with reason: Maintenance [08:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:33] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1123.eqiad.wmnet with reason: Maintenance [08:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:08] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1123 (T298294)', diff saved to https://phabricator.wikimedia.org/P22181 and previous config saved to /var/cache/conftool/dbconfig/20220309-080307-marostegui.json [08:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:11] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:03:16] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:06:37] (03PS11) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 [08:06:39] (03PS1) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [08:06:41] (03PS1) 10Giuseppe Lavagetto: cache: turn on dynamic bans on all of eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769389 (https://phabricator.wikimedia.org/T302471) [08:06:43] (03PS1) 10Giuseppe Lavagetto: cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) [08:11:52] !log dbmaint on s7@eqiad T300380 [08:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:56] T300380: Make page_restrictions.pr_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300380 [08:13:14] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add dedicated service unit file [puppet] - 10https://gerrit.wikimedia.org/r/769065 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [08:17:50] (03PS12) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [08:17:52] (03PS2) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [08:17:54] (03PS2) 10Giuseppe Lavagetto: cache: turn on dynamic bans on all of eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769389 (https://phabricator.wikimedia.org/T302471) [08:17:56] (03PS2) 10Giuseppe Lavagetto: cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) [08:20:52] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298294)', diff saved to https://phabricator.wikimedia.org/P22182 and previous config saved to /var/cache/conftool/dbconfig/20220309-082051-marostegui.json [08:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:55] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:21:02] !log dbmaint on s3@eqiad T300380 [08:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:05] T300380: Make page_restrictions.pr_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300380 [08:25:30] (JobUnavailable) firing: (3) Reduced availability for job gitlab in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [08:33:47] (03PS13) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [08:33:49] (03PS3) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [08:33:51] (03PS3) 10Giuseppe Lavagetto: cache: turn on dynamic bans on all of eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769389 (https://phabricator.wikimedia.org/T302471) [08:33:53] (03PS3) 10Giuseppe Lavagetto: cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) [08:36:27] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P22183 and previous config saved to /var/cache/conftool/dbconfig/20220309-083626-marostegui.json [08:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:16] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34155/console" [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [08:39:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [08:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:12] (03PS3) 10Ladsgroup: Wikibase dumps: Lower batch size (reduce run time) [puppet] - 10https://gerrit.wikimedia.org/r/768032 (https://phabricator.wikimedia.org/T300255) (owner: 10Hoo man) [08:40:16] (03CR) 10Ladsgroup: [C: 03+2] Wikibase dumps: Lower batch size (reduce run time) [puppet] - 10https://gerrit.wikimedia.org/r/768032 (https://phabricator.wikimedia.org/T300255) (owner: 10Hoo man) [08:40:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Wikibase dumps: Lower batch size (reduce run time) [puppet] - 10https://gerrit.wikimedia.org/r/768032 (https://phabricator.wikimedia.org/T300255) (owner: 10Hoo man) [08:43:19] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host dumpsdata1007.eqiad.wmnet [08:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:31] (03PS3) 10Marostegui: change_pr_page_T300380.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769384 (https://phabricator.wikimedia.org/T300380) [08:43:34] (03PS1) 10Ayounsi: Redirect one of Microsoft's range to codfw [dns] - 10https://gerrit.wikimedia.org/r/769392 (https://phabricator.wikimedia.org/T282861) [08:43:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [08:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:15] (03CR) 10Ayounsi: "Based on https://w.wiki/4vwk" [dns] - 10https://gerrit.wikimedia.org/r/769392 (https://phabricator.wikimedia.org/T282861) (owner: 10Ayounsi) [08:45:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Redirect one of Microsoft's range to codfw [dns] - 10https://gerrit.wikimedia.org/r/769392 (https://phabricator.wikimedia.org/T282861) (owner: 10Ayounsi) [08:46:04] (03CR) 10Ayounsi: [C: 03+2] Redirect one of Microsoft's range to codfw [dns] - 10https://gerrit.wikimedia.org/r/769392 (https://phabricator.wikimedia.org/T282861) (owner: 10Ayounsi) [08:46:05] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for OTichonova - https://phabricator.wikimedia.org/T303376 (10Peachey88) [08:46:27] !log Redirect one of Microsoft's range to codfw - T282861 [08:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:44] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host dumpsdata1007.eqiad.wmnet [08:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [08:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:03] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P22184 and previous config saved to /var/cache/conftool/dbconfig/20220309-085201-marostegui.json [08:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:44] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host dumpsdata1007.eqiad.wmnet [08:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [08:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:52] (03CR) 10MVernon: [C: 03+1] "Thanks - this looks right to me; can you reopen T294137 (or make a new ticket to track this) please?" [puppet] - 10https://gerrit.wikimedia.org/r/769382 (https://phabricator.wikimedia.org/T294137) (owner: 10Elukey) [09:00:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [09:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:37] (03CR) 10Elukey: [C: 03+2] Add role::insetup for ms-fe1012 [puppet] - 10https://gerrit.wikimedia.org/r/769382 (https://phabricator.wikimedia.org/T294137) (owner: 10Elukey) [09:06:47] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10elukey) 05Resolved→03Open Hi Chris! There are a couple of issues with this task: 1) The new hosts were added to site.pp with https... [09:07:38] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298294)', diff saved to https://phabricator.wikimedia.org/P22186 and previous config saved to /var/cache/conftool/dbconfig/20220309-090737-marostegui.json [09:07:40] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:42] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:07:42] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:46] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:11:12] (03PS1) 10Ayounsi: Revert "Redirect one of Microsoft's range to codfw" [dns] - 10https://gerrit.wikimedia.org/r/769287 [09:15:04] (03CR) 10Ayounsi: [C: 03+2] Revert "Redirect one of Microsoft's range to codfw" [dns] - 10https://gerrit.wikimedia.org/r/769287 (owner: 10Ayounsi) [09:16:53] !log dbmaint on s4@eqiad T298295 [09:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:56] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [09:17:20] (03CR) 10Volans: "Using wmflib you don't need to re-implement the functionality." [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite) [09:18:54] !log dbmaint on s1@eqiad T298295 [09:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:43] !log dbmaint on s2@eqiad T298295 [09:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:47] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [09:26:55] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:57] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:32] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1166 (T298294)', diff saved to https://phabricator.wikimedia.org/P22187 and previous config saved to /var/cache/conftool/dbconfig/20220309-092731-marostegui.json [09:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:36] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:30:42] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:45] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:20] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1098:3317 (T300775)', diff saved to https://phabricator.wikimedia.org/P22188 and previous config saved to /var/cache/conftool/dbconfig/20220309-093119-marostegui.json [09:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:23] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [09:34:07] (03CR) 10DCausse: [C: 03+1] elasticsearch: upgrade relforge to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763479 (https://phabricator.wikimedia.org/T301955) (owner: 10Gehel) [09:45:02] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298294)', diff saved to https://phabricator.wikimedia.org/P22189 and previous config saved to /var/cache/conftool/dbconfig/20220309-094501-marostegui.json [09:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:06] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:45:10] !log dbmaint on s7@eqiad T298295 [09:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:13] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [09:46:19] (03PS1) 10Btullis: Move datahubsearch service from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/769398 (https://phabricator.wikimedia.org/T301458) [09:47:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2147', diff saved to https://phabricator.wikimedia.org/P22190 and previous config saved to /var/cache/conftool/dbconfig/20220309-094704-marostegui.json [09:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:04] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34157/console" [puppet] - 10https://gerrit.wikimedia.org/r/769398 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [09:58:36] (03CR) 10Btullis: [V: 03+1] "I believe that this change is required in order to generate the /srv/config-master/pybal/eqiad/datahubsearch file that is being referenced" [puppet] - 10https://gerrit.wikimedia.org/r/769398 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [10:00:37] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22191 and previous config saved to /var/cache/conftool/dbconfig/20220309-100036-marostegui.json [10:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:58] (03CR) 10Volans: "Nice! thanks for putting all the pieces from the task together." [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [10:04:59] (03PS1) 10Ladsgroup: reenable DPL on nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769400 [10:05:12] jouncebot: nowandnext [10:05:12] No deployments scheduled for the next 3 hour(s) and 54 minute(s) [10:05:12] In 3 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T1400) [10:06:32] (03CR) 10Ladsgroup: [C: 03+2] reenable DPL on nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769400 (owner: 10Ladsgroup) [10:07:14] (03Merged) 10jenkins-bot: reenable DPL on nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769400 (owner: 10Ladsgroup) [10:08:46] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:769400|reenable DPL on nowikimedia]] (duration: 00m 51s) [10:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:07] (03CR) 10Volans: "Add a comment on nice to have protections" [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [10:11:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [10:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:51] (03CR) 10Volans: [C: 03+2] elasticsearch: load config from yaml (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [10:12:54] (03PS22) 10Volans: elasticsearch: load config from yaml [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (https://phabricator.wikimedia.org/T278378) (owner: 10Ryan Kemper) [10:13:50] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:11] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P22192 and previous config saved to /var/cache/conftool/dbconfig/20220309-101610-marostegui.json [10:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [10:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [10:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:19:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [10:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:30] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:21:46] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I'm voting -1 mostly because I find the usage of the `cloud` term conflicting, in particular the keyword `cloudnet`." [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [10:25:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:05] !log dbmaint on s3@eqiad T298295 [10:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:08] T298295: Fix length of columns page_restrictions.pr_level/pr_type on wmf wikis - https://phabricator.wikimedia.org/T298295 [10:29:31] !log dbmaint on s6@eqiad T272512 [10:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:34] T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken) - https://phabricator.wikimedia.org/T272512 [10:31:47] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298294)', diff saved to https://phabricator.wikimedia.org/P22193 and previous config saved to /var/cache/conftool/dbconfig/20220309-103146-marostegui.json [10:31:49] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:51] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [10:31:51] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:27] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1175 (T298294)', diff saved to https://phabricator.wikimedia.org/P22194 and previous config saved to /var/cache/conftool/dbconfig/20220309-103226-marostegui.json [10:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet [10:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:49] (03CR) 10Ladsgroup: [C: 03+1] change_pr_page_T300380.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769384 (https://phabricator.wikimedia.org/T300380) (owner: 10Marostegui) [10:38:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet [10:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:34] !log btullis@cumin2002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [10:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2001.wikimedia.org [10:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:57] (03CR) 10Marostegui: [C: 03+2] change_pr_page_T300380.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769384 (https://phabricator.wikimedia.org/T300380) (owner: 10Marostegui) [10:41:20] (03Merged) 10jenkins-bot: change_pr_page_T300380.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769384 (https://phabricator.wikimedia.org/T300380) (owner: 10Marostegui) [10:42:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2001.wikimedia.org [10:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:34] (03PS1) 10Marostegui: change_pr_type_pr_level_T298295.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769403 (https://phabricator.wikimedia.org/T298295) [10:43:49] (03PS4) 10Jbond: (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [10:44:12] (03PS1) 10Ladsgroup: labs: Set TemplateLinksSchemaMigrationStage to WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769404 (https://phabricator.wikimedia.org/T299420) [10:45:59] (03CR) 10Ladsgroup: [C: 03+2] labs: Set TemplateLinksSchemaMigrationStage to WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769404 (https://phabricator.wikimedia.org/T299420) (owner: 10Ladsgroup) [10:46:38] (03Merged) 10jenkins-bot: labs: Set TemplateLinksSchemaMigrationStage to WRITE_BOTH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769404 (https://phabricator.wikimedia.org/T299420) (owner: 10Ladsgroup) [10:50:10] (03PS5) 10Marostegui: mariadb: Promote db1107 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/768954 (https://phabricator.wikimedia.org/T302190) [10:51:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:40] !log btullis@cumin2002 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [10:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:55:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:35] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt1016.eqiad.wmnet [10:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:36] (03CR) 10Elukey: [C: 04-1] "partman recipe is not right" [puppet] - 10https://gerrit.wikimedia.org/r/769085 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [11:02:50] (03PS2) 10Awight: Template search improvements to all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767510 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [11:02:55] (03CR) 10Awight: [C: 03+2] Template search improvements to all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767510 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [11:03:38] (03Merged) 10jenkins-bot: Template search improvements to all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767510 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [11:05:13] Been a while since I've deployed, sorry for the deviation from protocol... [11:05:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:02] I'm merging and deploying these config settings, which were scheduled for this morning: https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_March_9 [11:07:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:07:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:34] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:11:16] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:767510|Template search improvements to all wikis except enwiki (T286990)]] (duration: 00m 51s) [11:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:20] T286990: Deploy template search improvements, back button+warning message, and delete button to all wikis (except enwiki) - https://phabricator.wikimedia.org/T286990 [11:11:34] (03PS2) 10Awight: VE template back and delete button on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767508 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [11:11:42] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767508 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [11:12:26] (03Merged) 10jenkins-bot: VE template back and delete button on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767508 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [11:13:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host cumin1001.eqiad.wmnet with OS bullseye [11:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:14:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:05] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:767508|VE template back and delete button on all wikis except enwiki (T286990)]] (duration: 00m 50s) [11:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:08] T286990: Deploy template search improvements, back button+warning message, and delete button to all wikis (except enwiki) - https://phabricator.wikimedia.org/T286990 [11:17:26] (03PS4) 10Awight: VE template expanded sidebar and inline descriptions on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767512 (https://phabricator.wikimedia.org/T286991) (owner: 10WMDE-Fisch) [11:18:56] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767512 (https://phabricator.wikimedia.org/T286991) (owner: 10WMDE-Fisch) [11:19:09] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298294)', diff saved to https://phabricator.wikimedia.org/P22195 and previous config saved to /var/cache/conftool/dbconfig/20220309-111907-marostegui.json [11:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:12] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [11:19:40] (03Merged) 10jenkins-bot: VE template expanded sidebar and inline descriptions on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767512 (https://phabricator.wikimedia.org/T286991) (owner: 10WMDE-Fisch) [11:20:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:21:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:47] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:26:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cumin1001.eqiad.wmnet with reason: host reimage [11:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:36] (03PS4) 10Ayounsi: Added optional ability to enable uRPF filtering on arbitary CR ints [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) (owner: 10Cathal Mooney) [11:27:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:48] (03CR) 10jerkins-bot: [V: 04-1] Added optional ability to enable uRPF filtering on arbitary CR ints [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) (owner: 10Cathal Mooney) [11:29:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:29:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cumin1001.eqiad.wmnet with reason: host reimage [11:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:21] (03PS2) 10Awight: Bracket matching on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767499 (https://phabricator.wikimedia.org/T280023) (owner: 10WMDE-Fisch) [11:32:29] !log awight@deploy1002 Synchronized wmf-config/: Config: [[gerrit:767512|VE template expanded sidebar and inline descriptions on all wikis except enwiki (T286991)]] (duration: 00m 51s) [11:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:32] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767499 (https://phabricator.wikimedia.org/T280023) (owner: 10WMDE-Fisch) [11:32:32] T286991: Deploy inline descriptions, extended sidebar and bigger dialog to all wikis (except enwiki) - https://phabricator.wikimedia.org/T286991 [11:33:14] (03Merged) 10jenkins-bot: Bracket matching on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767499 (https://phabricator.wikimedia.org/T280023) (owner: 10WMDE-Fisch) [11:34:45] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22197 and previous config saved to /var/cache/conftool/dbconfig/20220309-113442-marostegui.json [11:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:52] (03CR) 10DCausse: elastic: relax & restore perms during upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [11:36:28] (03PS2) 10Awight: Syntax highlighting color scheme update on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767498 (https://phabricator.wikimedia.org/T280024) (owner: 10WMDE-Fisch) [11:36:35] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767498 (https://phabricator.wikimedia.org/T280024) (owner: 10WMDE-Fisch) [11:37:01] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:767499|Bracket matching on all wikis except enwiki (T280023)]] (duration: 00m 49s) [11:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:04] T280023: Enable bracket matching on all wikis (except enwiki) - https://phabricator.wikimedia.org/T280023 [11:37:18] (03PS5) 10Jbond: (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [11:37:20] (03Merged) 10jenkins-bot: Syntax highlighting color scheme update on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767498 (https://phabricator.wikimedia.org/T280024) (owner: 10WMDE-Fisch) [11:38:04] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:38:28] (03CR) 10jerkins-bot: [V: 04-1] (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [11:39:28] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/datahubsearch on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:39:50] btullis: ^^ nice [11:40:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:33] vgutierrez: Thanks. Apologies that it took so long. [11:40:40] (03PS6) 10Jbond: (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [11:41:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:41:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:24] !log btullis@cumin2002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [11:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:59] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:767498|Syntax highlighting color scheme update on all wikis except enwiki (T280024)]] (duration: 00m 50s) [11:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:03] T280024: Enable syntax highlighting color scheme update on all wikis (except enwiki) - https://phabricator.wikimedia.org/T280024 [11:42:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:22] (03CR) 10Ladsgroup: [C: 03+1] change_pr_type_pr_level_T298295.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769403 (https://phabricator.wikimedia.org/T298295) (owner: 10Marostegui) [11:42:36] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/datahubsearch on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:42:42] (03CR) 10Marostegui: [C: 03+2] change_pr_type_pr_level_T298295.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769403 (https://phabricator.wikimedia.org/T298295) (owner: 10Marostegui) [11:43:05] (03Merged) 10jenkins-bot: change_pr_type_pr_level_T298295.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769403 (https://phabricator.wikimedia.org/T298295) (owner: 10Marostegui) [11:43:28] !log sketchy EU deployment complete. [11:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:20] (03CR) 10Btullis: [V: 03+1] Move datahubsearch service from service_setup to lvs_setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769398 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [11:48:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:20] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P22198 and previous config saved to /var/cache/conftool/dbconfig/20220309-115019-marostegui.json [11:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "honestly, the new naming you choose feels way better :-)" [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [11:57:47] (03PS7) 10Jbond: (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [11:57:49] (03CR) 10Jbond: "thanks all, updated" [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [11:58:24] (03CR) 10jerkins-bot: [V: 04-1] (WIP) C:varnish: Add automatic cloud nets update [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [11:59:56] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1107 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/768954 (https://phabricator.wikimedia.org/T302190) (owner: 10Marostegui) [12:03:09] (03PS8) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [12:03:11] (03PS1) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [12:04:00] (03CR) 10Jbond: "i have split out the script downloading functionality to a separate module" [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [12:04:08] (03CR) 10jerkins-bot: [V: 04-1] O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [12:05:56] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298294)', diff saved to https://phabricator.wikimedia.org/P22199 and previous config saved to /var/cache/conftool/dbconfig/20220309-120554-marostegui.json [12:05:57] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:05:57] (03PS2) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [12:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:00] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:06:00] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:37] (03PS1) 10Muehlenhoff: python::venv: Create deploy-foo user using systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/769411 [12:06:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769411 (owner: 10Muehlenhoff) [12:07:11] (03PS9) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [12:07:26] (03PS10) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [12:08:13] (03CR) 10Vgutierrez: O:external_clouds_vendors: New module for fetching cloud networks (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [12:08:55] (03CR) 10Vgutierrez: O:external_clouds_vendors: New module for fetching cloud networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [12:13:30] (03PS21) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [12:15:10] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769411 (owner: 10Muehlenhoff) [12:17:05] (03CR) 10Muehlenhoff: [C: 03+2] python::venv: Create deploy-foo user using systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/769411 (owner: 10Muehlenhoff) [12:18:08] (03CR) 10Marostegui: [C: 03+1] dbtools: Add db_maint_mapper_sal.py [software] - 10https://gerrit.wikimedia.org/r/768687 (owner: 10Ladsgroup) [12:19:21] (03CR) 10Ladsgroup: [C: 03+2] dbtools: Add db_maint_mapper_sal.py [software] - 10https://gerrit.wikimedia.org/r/768687 (owner: 10Ladsgroup) [12:20:00] (03Merged) 10jenkins-bot: dbtools: Add db_maint_mapper_sal.py [software] - 10https://gerrit.wikimedia.org/r/768687 (owner: 10Ladsgroup) [12:21:54] (03PS2) 10MSantos: WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 [12:22:20] (03CR) 10Btullis: "helm-lint now passes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [12:23:15] (03CR) 10jerkins-bot: [V: 04-1] WIP: introduce geoshapes service [deployment-charts] - 10https://gerrit.wikimedia.org/r/768678 (owner: 10MSantos) [12:24:59] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:02] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1179.eqiad.wmnet with reason: Maintenance [12:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:37] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1179 (T298294)', diff saved to https://phabricator.wikimedia.org/P22200 and previous config saved to /var/cache/conftool/dbconfig/20220309-122536-marostegui.json [12:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:40] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:25:56] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [12:26:11] !log btullis@cumin2002 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [12:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:04] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:27:23] (03CR) 10Vgutierrez: [C: 03+1] "You can proceed, first on lvs1020 and then if everything goes as expected, with lvs1019" [puppet] - 10https://gerrit.wikimedia.org/r/769398 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [12:28:20] (03PS1) 10Muehlenhoff: Create /etc/spicerack/elasticsearch/ [puppet] - 10https://gerrit.wikimedia.org/r/769421 [12:28:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769421 (owner: 10Muehlenhoff) [12:29:00] (03CR) 10jerkins-bot: [V: 04-1] Create /etc/spicerack/elasticsearch/ [puppet] - 10https://gerrit.wikimedia.org/r/769421 (owner: 10Muehlenhoff) [12:29:02] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34158/console" [puppet] - 10https://gerrit.wikimedia.org/r/769398 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [12:32:54] (03CR) 10Volans: "This patch broke Puppet on the cumin hosts because it's not creating the parent directory of the added file, that doesn't exists. Please a" [puppet] - 10https://gerrit.wikimedia.org/r/768816 (https://phabricator.wikimedia.org/T278378) (owner: 10Razzi) [12:33:15] (03PS2) 10Muehlenhoff: Create /etc/spicerack/elasticsearch/ [puppet] - 10https://gerrit.wikimedia.org/r/769421 [12:34:46] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/769421 (owner: 10Muehlenhoff) [12:36:58] (03CR) 10Muehlenhoff: [C: 03+2] Create /etc/spicerack/elasticsearch/ [puppet] - 10https://gerrit.wikimedia.org/r/769421 (owner: 10Muehlenhoff) [12:39:50] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:46:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cumin1001.eqiad.wmnet with OS bullseye [12:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:36] (03PS1) 10Muehlenhoff: role::cluster_management: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/769426 [12:54:22] RECOVERY - Host ms-fe1012 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [12:55:50] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298294)', diff saved to https://phabricator.wikimedia.org/P22201 and previous config saved to /var/cache/conftool/dbconfig/20220309-125549-marostegui.json [12:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:55] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:56:28] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on sretest[1001-1002].eqiad.wmnet with reason: just a test [12:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:30] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on sretest[1001-1002].eqiad.wmnet with reason: just a test [12:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:08] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300775)', diff saved to https://phabricator.wikimedia.org/P22202 and previous config saved to /var/cache/conftool/dbconfig/20220309-125907-marostegui.json [12:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:12] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:01:54] (03PS1) 10WMDE-Fisch: Fix missing padding on inline descriptions [extensions/VisualEditor] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769296 (https://phabricator.wikimedia.org/T303386) [13:02:12] (03PS1) 10WMDE-Fisch: Fix missing padding on inline descriptions [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769297 (https://phabricator.wikimedia.org/T303386) [13:05:18] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10cmooney) @elukey thanks for looking at this. I am alarmed and not sure what was the cause of the network issues here. What seemed to be broken was that the... [13:09:35] (03PS5) 10Tchanders: Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) [13:09:37] (03PS4) 10Tchanders: Autopromote-once users to the 'ipinfo-viewer' group after one edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) [13:10:21] (03CR) 10Tchanders: [C: 03+1] "Re-adding STran's +1 (only change since was a fix to the commit summary)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) (owner: 10Tchanders) [13:10:28] (03PS3) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [13:10:55] (03PS1) 10Jcrespo: Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) [13:11:08] (03CR) 10jerkins-bot: [V: 04-1] O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [13:11:25] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22203 and previous config saved to /var/cache/conftool/dbconfig/20220309-131124-marostegui.json [13:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:43] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22204 and previous config saved to /var/cache/conftool/dbconfig/20220309-131442-marostegui.json [13:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:41] (03PS4) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [13:16:17] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [13:16:51] (03PS2) 10Jcrespo: Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) [13:17:21] (03CR) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [13:18:03] (03PS3) 10Jcrespo: Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) [13:20:47] (03CR) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [13:27:01] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P22205 and previous config saved to /var/cache/conftool/dbconfig/20220309-132700-marostegui.json [13:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:18] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22206 and previous config saved to /var/cache/conftool/dbconfig/20220309-133017-marostegui.json [13:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:38] (03PS1) 10Muehlenhoff: Fix definition of /srv/pwstore [puppet] - 10https://gerrit.wikimedia.org/r/769430 [13:35:59] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [13:38:08] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34160/console" [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [13:42:36] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298294)', diff saved to https://phabricator.wikimedia.org/P22207 and previous config saved to /var/cache/conftool/dbconfig/20220309-134235-marostegui.json [13:42:37] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:40] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:42:40] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [13:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:23] (03PS1) 10Vgutierrez: varnish: Rate limit public_cloud_nets on upload [puppet] - 10https://gerrit.wikimedia.org/r/769432 (https://phabricator.wikimedia.org/T282861) [13:44:30] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34161/console" [puppet] - 10https://gerrit.wikimedia.org/r/769398 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [13:44:50] (03CR) 10Btullis: [V: 03+1 C: 03+2] Move datahubsearch service from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/769398 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [13:45:53] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300775)', diff saved to https://phabricator.wikimedia.org/P22208 and previous config saved to /var/cache/conftool/dbconfig/20220309-134552-marostegui.json [13:45:55] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [13:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:56] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:45:57] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [13:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:15] (03PS1) 10Zabe: wmf.24 HACK: Add forward class alias for Gadget [extensions/Gadgets] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769433 (https://phabricator.wikimedia.org/T303391) [13:46:32] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1101:3317 (T300775)', diff saved to https://phabricator.wikimedia.org/P22209 and previous config saved to /var/cache/conftool/dbconfig/20220309-134631-marostegui.json [13:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:03] !log dbmaint on s8@eqiad T272512 [13:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:07] T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken) - https://phabricator.wikimedia.org/T272512 [13:50:59] !log restarting pybal on lvs102 T301458 [13:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:02] T301458: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 [13:51:03] (03CR) 10Vgutierrez: O:external_clouds_vendors: New module for fetching cloud networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [13:53:34] (03PS4) 10Btullis: Add a record for datahubsearch service [dns] - 10https://gerrit.wikimedia.org/r/768663 (https://phabricator.wikimedia.org/T301458) [13:54:02] (03CR) 10Btullis: [C: 03+2] Add a record for datahubsearch service [dns] - 10https://gerrit.wikimedia.org/r/768663 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [13:54:26] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 71 connections established with conf1004.eqiad.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal [13:55:09] ^^ expected [13:56:51] (03CR) 10Ayounsi: [C: 03+1] "I don't know VCL enough for a code review but +1 on rate limiting public clouds toward upload" [puppet] - 10https://gerrit.wikimedia.org/r/769432 (https://phabricator.wikimedia.org/T282861) (owner: 10Vgutierrez) [13:57:37] vgutierrez: Thanks. I haven't restarted lvs1019 yet though. Still expected? [13:57:48] yes, that's why it's expected [13:57:48] ;P [13:57:58] puppet ran there, so the icinga check has been updated [13:58:10] but pybal hasn't got the new config loaded till you restart it [13:58:36] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.71:9200]) https://wikitech.wikimedia.org/wiki/PyBal [13:58:45] OK, thanks. [13:58:53] that's also expected :) [13:58:56] (same reason) [13:59:11] !log restarting pybal on lvs1019 T301458 [13:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:17] T301458: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 [14:00:01] legoktm Amir1 andrewbogott bd808 around for m5 switchover? [14:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T1400). [14:00:05] Tchanders, awight, and zabe: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:20] I’m in a meeting and probably can’t deploy, sorry [14:01:07] marostegui: o/ [14:01:09] * andrewbogott is ready [14:01:13] do mw-deploys and the m5 switchover conflict? [14:01:13] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [14:01:14] \o/ [14:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:16] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [14:01:17] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:19] zabe: no [14:01:24] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:33] !log Failover m5 from db1132 to db1107 - T302190 [14:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:36] T302190: Switchover m5 master (db1132 -> db1107) - https://phabricator.wikimedia.org/T302190 [14:01:52] I can't deploy either -- but i can watch another deployer and help if needed. [14:01:56] I can deploy if needed [14:01:59] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1112 (T298294)', diff saved to https://phabricator.wikimedia.org/P22210 and previous config saved to /var/cache/conftool/dbconfig/20220309-140158-marostegui.json [14:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:02] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [14:02:10] legoktm Amir1 andrewbogott bd808 all done [14:02:25] I'm here now [14:02:40] taavi: I'd appreciate that. @Tchanders is also a deployer and might want to self serve? [14:02:45] legoktm Amir1 andrewbogott bd808 let's check services [14:02:50] urbanecm: ack [14:02:54] wikitech seems ok, I just edited my home page [14:02:55] Tchanders, awight: around? [14:02:57] Mailman seems up, sending a test email now [14:02:57] Toolhub looks fine [14:03:04] andrewbogott: wikitech is no longer on m5! [14:03:14] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/769426 (owner: 10Muehlenhoff) [14:03:14] you're right! Well, it seems ok anyway :) [14:03:27] Striker looks fine [14:03:29] taavi: Would you be able to deploy my patches possibly? I'm newly back from leave and it has been a while! [14:03:33] * andrewbogott just woke up and can't remember what we use m5 for anymore [14:03:38] striker I guess [14:03:43] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10MatthewVernon) @cmooney thanks for the update. To be clear, do you think I'm OK to put this one back into swift::proxy now? I might then procrastinate actual... [14:03:51] striker and maintain-dbusers I think [14:03:56] Tchanders: sure, no worries [14:03:58] taavi: Yes, and happy to deploy my patch after Tchanders's. [14:04:01] yeah, labsdbaccounts is maintain-dbusers [14:04:07] andrewbogott: you've got the affected services at https://phabricator.wikimedia.org/T302190 [14:04:14] awight: ack, thanks [14:04:15] at start of the task [14:04:16] taavi: Thanks! [14:04:24] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769430 (owner: 10Muehlenhoff) [14:04:35] marostegui: I see! thanks [14:05:27] (03PS1) 10Volans: sre.SREBatchBaseRunner: fix puppet runs [cookbooks] - 10https://gerrit.wikimedia.org/r/769435 [14:05:29] (03PS1) 10Volans: sre.SRELBBatchRunnerBase: improve service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/769436 [14:05:31] (03PS1) 10Volans: sre.SREBatchRunnerBase: use alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/769437 [14:05:33] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: add a new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/769438 [14:05:43] I presume awight's backports will take a while for Jenkins to process them, so I'm starting with Tchanders's config patches [14:05:49] On Striker I just got 'Error updating database. [req id: dab306c16711468c935f2730d8931883]' [14:05:56] bd808: did you try a write? [14:06:02] (feel free to +2 the backports already, just don't pull them to deploy1002 yet please) [14:06:21] andrewbogott: hmmm... I did not. I assumed that a read would be enough [14:06:26] andrewbogott: fixing that [14:06:35] andrewbogott: try again [14:06:48] marostegui: looks good now [14:06:52] \o/ [14:07:04] legoktm: ^ this might have caused mailman to fail as well, if you can double check [14:08:08] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 72 connections established with conf1004.eqiad.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal [14:08:12] marostegui: seems like it, but my emails have gone through now, so it all looks good! [14:08:20] sweeet [14:08:33] bd808 andrewbogott all good from your side? [14:08:34] Tchanders: since you're enabling an extension for the first time on production, I'm just double checking everything needed for that has done - will take a moment, sorry [14:08:43] marostegui: yep, lgtm [14:08:52] excellent, I will wrap up and close the task then [14:08:56] thank you all very much [14:09:04] taavi: Thanks - any questions let me know [14:10:46] thx marostegui [14:11:16] ^^ [14:11:37] Tchanders: I see the extension creates a database table called 'ipinfo_ip_changes' but that does not exist on testwiki - expected? [14:11:41] thanks marostegui, and thanks for accommodating my timezone [14:12:12] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:12:24] bd808: my pleasure, and sorry for not thinking about it in the first place. I had the autopilot set from the other mX sections where we have EU people as service owners! [14:12:41] taavi: Good point. No it's not used yet https://codesearch.wmcloud.org/search/?q=ipinfo_ip_changes&i=nope&files=&excludeFiles=&repos= [14:13:19] taavi: yes, we are on it [14:13:33] Sorry I missed the window, I was afk for lunch and forgot [14:13:49] additionally I don't see a beta feature review anywhere on phabricator (required according to https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment#Preparing_for_deployment) [14:14:17] (03PS5) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [14:14:26] (03PS11) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [14:14:34] taavi: that's outdated, it's not really required [14:15:07] I'm not comfortable doing something that [[mw:Writing_an_extension_for_deployment]] explicitly says is not allowed [14:15:25] if it's not needed, then the docs should be updated [14:16:01] (03PS1) 10Marostegui: db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/769442 [14:16:38] (03CR) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [14:16:59] (03PS12) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [14:17:14] (03CR) 10Volans: C:varnish: Add the external_cloud_vendors module to the cache clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [14:17:17] (03PS1) 10MVernon: site: ms-fe1012 no longer insetup [puppet] - 10https://gerrit.wikimedia.org/r/769443 (https://phabricator.wikimedia.org/T294137) [14:17:35] taavi: it's a wiki [14:17:50] (03CR) 10Elukey: [C: 03+1] site: ms-fe1012 no longer insetup [puppet] - 10https://gerrit.wikimedia.org/r/769443 (https://phabricator.wikimedia.org/T294137) (owner: 10MVernon) [14:17:52] (03CR) 10jerkins-bot: [V: 04-1] C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [14:18:12] taavi: Fair enough. We did speak to James Forrester but can't see a public conversation other than his presence on this task: https://phabricator.wikimedia.org/T292802. We could get a public OK and schedule for a later window if you're not comfortable... [14:18:25] (03CR) 10MVernon: [C: 03+2] site: ms-fe1012 no longer insetup [puppet] - 10https://gerrit.wikimedia.org/r/769443 (https://phabricator.wikimedia.org/T294137) (owner: 10MVernon) [14:18:38] (03CR) 10Marostegui: [C: 03+2] db1132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/769442 (owner: 10Marostegui) [14:18:39] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298294)', diff saved to https://phabricator.wikimedia.org/P22211 and previous config saved to /var/cache/conftool/dbconfig/20220309-141837-marostegui.json [14:18:40] Amir1: yes, but I don't want to be the one responsible for updating policy regarding production deployments [14:18:40] (03CR) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [14:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:42] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [14:18:47] (03PS13) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [14:19:06] Emperor: I have merged your changes, thought they were mine! [14:19:22] Tchanders: I think that would be the best option, sorry about that :-( [14:19:35] do you still want me to deploy the beta cluster only change? [14:20:06] marostegui: I was just scratching my head as to why puppet-merge was only offering me one of your changes :-) [14:20:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/769435 (owner: 10Volans) [14:20:11] haha [14:20:17] Emperor: so please merge mine :) [14:20:25] Tchanders: I think taavi is asking for a beta feature [14:20:25] done :) [14:20:26] team work [14:20:28] not beta cluster [14:20:29] thanks! [14:20:37] Am I misunderstadning [14:21:01] (03CR) 10Muehlenhoff: [C: 03+2] Fix definition of /srv/pwstore [puppet] - 10https://gerrit.wikimedia.org/r/769430 (owner: 10Muehlenhoff) [14:21:33] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/766882 affects the beta cluster only (since $wmgUseIPInfo is true only there), https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/767216 and the autopromote patches touch production which according to my understanding of the policy is blocked on a beta feature review [14:21:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/769436 (owner: 10Volans) [14:21:58] taavi: No problem, thanks for offering! Are you asking if we could still do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/766882 ? Would be great if so [14:22:21] (03PS6) 10Majavah: Add IPInfo viewing rights for certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [14:22:55] (03CR) 10CDanis: [C: 03+1] R:varnish:instance: Add general public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [14:23:10] (03CR) 10Jbond: "LGTM minor comment question" [cookbooks] - 10https://gerrit.wikimedia.org/r/769437 (owner: 10Volans) [14:23:23] Amir1: I think we're talking about a review of the beta feature (as in Extension:BetaFeatures) [14:23:34] (03CR) 10Majavah: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [14:23:43] yes, I don't think that's mandatory anymore [14:24:15] (03Merged) 10jenkins-bot: Add IPInfo viewing rights for certain groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766882 (https://phabricator.wikimedia.org/T296499) (owner: 10STran) [14:24:22] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/769437 (owner: 10Volans) [14:24:48] it depends on the feature, ofc if it's something massive, it must be a beta feature but I have seen many new extensions being deployed without going through beta feature phase [14:25:15] some even go with a different path, e.g. A/B testing (growth features for example) [14:26:07] Tchanders: hope that clears it out ^ [14:26:10] Tchanders: ok, merged so it should get deployed to beta within the next 30 mins (feel free to ping me if it does not), also syncing the file on production to avoid any suprises even through it's a no-op there [14:26:35] taavi: Thanks [14:27:26] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:766882|Add IPInfo viewing rights for certain groups (T296499)]] (no-op on prod) (duration: 00m 50s) [14:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:30] T296499: Grant certain groups the ipinfo-view-full right - https://phabricator.wikimedia.org/T296499 [14:27:37] done, again sorry for the trouble [14:28:27] (03CR) 10Majavah: [C: 03+2] Fix missing padding on inline descriptions [extensions/VisualEditor] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769296 (https://phabricator.wikimedia.org/T303386) (owner: 10WMDE-Fisch) [14:28:37] (03CR) 10Majavah: [C: 03+2] Fix missing padding on inline descriptions [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769297 (https://phabricator.wikimedia.org/T303386) (owner: 10WMDE-Fisch) [14:28:54] Amir1: Thanks for the advice. We have run it by James, but I'll just ask him to comment on our task for the sake of the deployer. Sounds like it's worth doing until the documentation is updated - I understand why you'd want to abide by what it says for now [14:29:10] (03CR) 10Jbond: [C: 03+1] "LGTM, see comment/warning" [cookbooks] - 10https://gerrit.wikimedia.org/r/769438 (owner: 10Volans) [14:29:13] the doc says "A beta feature review, if your extension adds a beta feature.". Are you adding a beta feature? [14:29:21] taavi: Thanks for helping us out, and sorry it was larger than it might have looked! [14:29:24] Amir1: Yeah [14:29:30] I see [14:29:44] I thought you need a "beta feature" for it [14:30:03] Amir1: Ah I see the confusion [14:30:16] that makes sense, James is great, let me know if I can help on anything [14:30:25] Thanks! [14:31:31] (03CR) 10Majavah: [C: 03+2] wmf.24 HACK: Add forward class alias for Gadget [extensions/Gadgets] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769433 (https://phabricator.wikimedia.org/T303391) (owner: 10Zabe) [14:31:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:09] ok, awight and zabe: your patches are now waiting for ci [14:32:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:33:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:14] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22212 and previous config saved to /var/cache/conftool/dbconfig/20220309-143413-marostegui.json [14:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:45] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) The helm charts and helmfile deployment are now passing the CI `helm-lint` stage. [14:39:38] (03CR) 10Volans: "LGTM apart a small bug. I didn't check PCC." [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [14:40:24] (03CR) 10Volans: [C: 03+2] sre.SREBatchBaseRunner: fix puppet runs [cookbooks] - 10https://gerrit.wikimedia.org/r/769435 (owner: 10Volans) [14:41:10] and of course one selenium job failed, causing the ci progress for the rest to start from 0 :-( [14:41:18] (03CR) 10Volans: [C: 03+2] sre.SRELBBatchRunnerBase: improve service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/769436 (owner: 10Volans) [14:41:38] awight: the failures in https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/139901/console look unrelated, right? [14:42:42] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/769438 (owner: 10Volans) [14:43:02] (03Merged) 10jenkins-bot: sre.SREBatchBaseRunner: fix puppet runs [cookbooks] - 10https://gerrit.wikimedia.org/r/769435 (owner: 10Volans) [14:43:05] taavi: right, this was just CSS in a dialog that wasn't touched by the test. And, > Error: connect ECONNREFUSED 127.0.0.1:38883 [14:43:55] (03Merged) 10jenkins-bot: sre.SRELBBatchRunnerBase: improve service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/769436 (owner: 10Volans) [14:44:05] (03CR) 10jerkins-bot: [V: 04-1] Fix missing padding on inline descriptions [extensions/VisualEditor] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769296 (https://phabricator.wikimedia.org/T303386) (owner: 10WMDE-Fisch) [14:44:33] * taavi retries it [14:44:45] (03CR) 10Majavah: [C: 03+2] "retrying, ci failure looks unrelated" [extensions/VisualEditor] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769296 (https://phabricator.wikimedia.org/T303386) (owner: 10WMDE-Fisch) [14:45:33] (03PS2) 10Elukey: Set Bullseye + overlayfs settings for kubernetes2007 [puppet] - 10https://gerrit.wikimedia.org/r/769085 (https://phabricator.wikimedia.org/T300744) [14:47:32] (03CR) 10JMeybohm: [C: 03+1] Set Bullseye + overlayfs settings for kubernetes2007 [puppet] - 10https://gerrit.wikimedia.org/r/769085 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:48:24] (03PS1) 10Jbond: puppetmaster::gitclone: small clean up patch [puppet] - 10https://gerrit.wikimedia.org/r/769445 [14:49:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34162/console" [puppet] - 10https://gerrit.wikimedia.org/r/769445 (owner: 10Jbond) [14:49:49] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P22213 and previous config saved to /var/cache/conftool/dbconfig/20220309-144948-marostegui.json [14:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:09] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769446 [14:50:45] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769446 (owner: 10Volans) [14:52:57] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34163/console" [puppet] - 10https://gerrit.wikimedia.org/r/769085 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:53:17] (03PS6) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [14:54:03] (03CR) 10jerkins-bot: [V: 04-1] O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [14:54:17] !log volans@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin1001.eqiad.wmnet with reason: Release v0.4.0 to reimaged cumin1001 - volans@cumin1001 [14:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:30] (03Merged) 10jenkins-bot: Fix missing padding on inline descriptions [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769297 (https://phabricator.wikimedia.org/T303386) (owner: 10WMDE-Fisch) [14:54:45] (03Merged) 10jenkins-bot: wmf.24 HACK: Add forward class alias for Gadget [extensions/Gadgets] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769433 (https://phabricator.wikimedia.org/T303391) (owner: 10Zabe) [14:54:59] finally [14:55:10] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin1001.eqiad.wmnet with reason: Release v0.4.0 to reimaged cumin1001 - volans@cumin1001 [14:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:33] zabe: anything to test with your patch? [14:55:47] no, except watching logstash [14:56:20] ok! syncing then [14:56:58] (03PS1) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769447 (https://phabricator.wikimedia.org/T270391) [14:57:13] (03CR) 10Elukey: [V: 03+1 C: 03+2] Set Bullseye + overlayfs settings for kubernetes2007 [puppet] - 10https://gerrit.wikimedia.org/r/769085 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:57:21] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.24/extensions/Gadgets/includes: Backport: [[gerrit:769433|wmf.24 HACK: Add forward class alias for Gadget (T303391)]] (1/2) (duration: 00m 50s) [14:57:23] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.3.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769446 (owner: 10Volans) [14:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:25] T303391: Flag-day change (cached values incompatible) in Gadgets extension brought translatewiki.net down - https://phabricator.wikimedia.org/T303391 [14:57:35] 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10Tchanders) [14:57:52] (03CR) 10Vgutierrez: R:varnish:instance: Add general public cloud rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [14:58:18] (03CR) 10jerkins-bot: [V: 04-1] O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769447 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [14:58:26] !log taavi@deploy1002 Synchronized php-1.38.0-wmf.24/extensions/Gadgets/extension.json: Backport: [[gerrit:769433|wmf.24 HACK: Add forward class alias for Gadget (T303391)]] (2/2) (duration: 00m 49s) [14:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:42] taavi: lmk when I should begin [14:59:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:31] zabe: looks good https://phabricator.wikimedia.org/P22214 [14:59:32] thanks! [14:59:35] (03PS1) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769448 (https://phabricator.wikimedia.org/T270391) [14:59:47] awight: I'm done with the rest of the patches, although one of your patches is in CI [14:59:49] (03PS1) 10Volans: Upstream release v2.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769449 [14:59:55] thanks, I'll start then! [14:59:56] (03Merged) 10jenkins-bot: Fix missing padding on inline descriptions [extensions/VisualEditor] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769296 (https://phabricator.wikimedia.org/T303386) (owner: 10WMDE-Fisch) [15:00:03] sure, thanks and sorry for the wait! [15:00:07] (03CR) 10jerkins-bot: [V: 04-1] O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769448 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [15:00:11] (perfect timing with that patch :D) [15:00:19] :-)! [15:00:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:00:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:50] (03PS1) 10Ladsgroup: mediawiki: Update some of education.wikimedia.org redirects [puppet] - 10https://gerrit.wikimedia.org/r/769450 (https://phabricator.wikimedia.org/T303397) [15:01:01] (03PS7) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [15:01:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2007.codfw.wmnet with OS bullseye [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:39] (03CR) 10jerkins-bot: [V: 04-1] O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [15:03:11] (03PS14) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [15:03:13] (03PS4) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [15:03:15] (03PS4) 10Giuseppe Lavagetto: cache: turn on dynamic bans on all of eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769389 (https://phabricator.wikimedia.org/T302471) [15:03:17] (03PS4) 10Giuseppe Lavagetto: cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) [15:03:27] (03PS8) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [15:03:41] !log awight@deploy1002 Synchronized php-1.38.0-wmf.24/extensions/VisualEditor/modules/ve-mw/ui/styles/pages/ve.ui.MWParameterPage.css: Backport: [[gerrit:769296|Fix missing padding on inline descriptions (T303386)]] (duration: 00m 49s) [15:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:45] T303386: Missing indentation on inline descriptions - https://phabricator.wikimedia.org/T303386 [15:03:58] (03CR) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [15:05:24] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298294)', diff saved to https://phabricator.wikimedia.org/P22215 and previous config saved to /var/cache/conftool/dbconfig/20220309-150523-marostegui.json [15:05:26] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [15:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:28] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [15:05:28] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [15:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:30] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 16:00:00 on 6 hosts with reason: Maintenance [15:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:39] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 6 hosts with reason: Maintenance [15:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:46] (03PS2) 10Ladsgroup: mediawiki: Update some of education.wikimedia.org redirects [puppet] - 10https://gerrit.wikimedia.org/r/769450 (https://phabricator.wikimedia.org/T303397) [15:05:53] (03CR) 10Vgutierrez: [C: 03+1] R:varnish:instance: Add general public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [15:05:54] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Update some of education.wikimedia.org redirects [puppet] - 10https://gerrit.wikimedia.org/r/769450 (https://phabricator.wikimedia.org/T303397) (owner: 10Ladsgroup) [15:06:08] !log awight@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/VisualEditor/modules/ve-mw/ui/styles/pages/ve.ui.MWParameterPage.css: Backport: [[gerrit:769297|Fix missing padding on inline descriptions (T303386)]] (duration: 00m 49s) [15:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:27] taavi: I'm finished with my backports, thank you! [15:06:31] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:06:34] thanks! [15:06:55] !log UTC afternoon deploys done [15:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:07:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:49] (03PS1) 10Btullis: Add monitoring for the datahubsearch LVS service [puppet] - 10https://gerrit.wikimedia.org/r/769451 (https://phabricator.wikimedia.org/T301458) [15:07:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34168/console" [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [15:07:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:00] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34167/console" [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [15:08:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10nskaggs) @Cmjohnson What's the status of imaging this box? Why did it fail? [15:08:26] (KubernetesCalicoDown) firing: kubernetes2007.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:08:51] (03PS1) 10Ladsgroup: mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) [15:08:53] (03CR) 10Btullis: "I think that this new LVS service is ready to move to the next state." [puppet] - 10https://gerrit.wikimedia.org/r/769451 (https://phabricator.wikimedia.org/T301458) (owner: 10Btullis) [15:10:50] (03CR) 10Vgutierrez: [C: 03+1] varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [15:11:18] (03PS10) 10Jbond: R:varnish:instance: Add general public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) [15:11:57] (03CR) 10Jbond: R:varnish:instance: Add general public cloud rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [15:13:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:26] (KubernetesCalicoDown) resolved: kubernetes2007.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:13:34] (03PS2) 10Ladsgroup: mediawiki: Remove most of unused education.wm.o redirects [puppet] - 10https://gerrit.wikimedia.org/r/769452 (https://phabricator.wikimedia.org/T303397) [15:13:51] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:13:56] (KubernetesCalicoDown) firing: kubernetes2007.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:14:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:14:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:47] (03PS2) 10Ssingh: certspotter: update package and replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/768065 (https://phabricator.wikimedia.org/T204993) [15:14:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:51] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2007.codfw.wmnet with reason: host reimage [15:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:01] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34169/console" [puppet] - 10https://gerrit.wikimedia.org/r/768065 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh) [15:17:14] (03CR) 10Volans: [C: 03+2] Upstream release v2.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769449 (owner: 10Volans) [15:19:07] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudcontrol1003, cloudcontrol1004, cloudcontrol1005, datahubsearch1001, datahubsearch1002, datahubsearch1003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:19:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2007.codfw.wmnet with reason: host reimage [15:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:33] (03CR) 10Volans: [C: 03+1] "LGTM, nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [15:23:22] (03Merged) 10jenkins-bot: Upstream release v2.3.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769449 (owner: 10Volans) [15:23:34] are folks still doing the backport window/ [15:23:41] (KubernetesCalicoDown) resolved: kubernetes2007.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:23:56] (KubernetesCalicoDown) firing: kubernetes2007.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:25:57] (03CR) 10Vgutierrez: [C: 03+1] certspotter: update package and replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/768065 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh) [15:26:01] (03PS3) 10C. Scott Ananian: Revert "Enable Parsoid API everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763779 (https://phabricator.wikimedia.org/T302081) [15:27:21] RoanKattouw: can I squeeze in an UBN backport into this window? [15:28:24] !log uploaded spicerack_2.3.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [15:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:50] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:49] (03CR) 10Vgutierrez: [C: 03+1] R:varnish:instance: Add general public cloud rate limiting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [15:31:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2007.codfw.wmnet with OS bullseye [15:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:16] Lucas_WMDE_, urbanecm , RoanKattouw : can i squeeze an UBN into the afternoon backport window? [15:33:27] cscott, the window is technically over. taavi was the one doing it. [15:33:41] (KubernetesCalicoDown) resolved: kubernetes2007.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:33:56] cscott: yeah, window's gone, but UBN can be done out of window anyway [15:33:59] !log deploy gerrit:740818 to add more genral rate limits for crawling cached and upload pages [15:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:02] do you want to self-deploy, or should i deploy for you? [15:34:02] (03CR) 10Jbond: [C: 03+2] R:varnish:instance: Add general public cloud rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/740818 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [15:34:03] thanks, zabe. the patches involved weren't marked {{done}} on wiki, so I wasn't sure if it was still in progress [15:34:14] (this is not to say that the patch cannot be depolyed now) [15:34:29] urbanecm: i'd rather someone else deploy, my skillz are very rusty (haven't deployed core in ~5 years) [15:34:33] (03Abandoned) 10Vgutierrez: varnish: Rate limit public_cloud_nets on upload [puppet] - 10https://gerrit.wikimedia.org/r/769432 (https://phabricator.wikimedia.org/T282861) (owner: 10Vgutierrez) [15:34:47] cscott: sure. in that case, can you link the patch please? [15:35:06] https://gerrit.wikimedia.org/r/c/768804/ [15:35:15] https://deploy-commands.toolforge.org/bacc/768804 [15:35:29] (03CR) 10Urbanecm: [C: 03+2] Ensure that the recognizedTagData static cache is properly initialized [core] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768804 (https://phabricator.wikimedia.org/T303360) (owner: 10C. Scott Ananian) [15:36:02] cscott: thanks, +2'ed. Will ping you once it's tested [15:37:33] (cscott: in case you want to practice the rusty skills a bit using the deploy-commands tool, I'm also happy to be your guide, up2you) [15:42:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:43:22] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:46:26] urbanecm: i've got 2 other patches to finish before 1.38 branches next week, so i'm perfectly happy not to be dusting off rusty skills this week. i'll practice my deploy-fu some other time. :) [15:49:43] cscott: sounds good :) [15:49:50] (03Merged) 10jenkins-bot: Ensure that the recognizedTagData static cache is properly initialized [core] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/768804 (https://phabricator.wikimedia.org/T303360) (owner: 10C. Scott Ananian) [15:50:26] cscott: pulled to mwdebug1001. Can you test the fix there please? [15:56:06] urbanecm: seems to work, at least in my banging on it [15:56:08] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host karapace1001.eqiad.wmnet [15:56:10] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [15:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:13] 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace - https://phabricator.wikimedia.org/T301563 (10BTullis) Proceeding with this now. ` btullis@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 20 eqiad_A karapace1001 Ready to create Ganeti VM karapace1001.eqiad.wmnet in the ganeti0... [15:56:13] okay, let's sync it then [15:56:45] urbanecm: the ticket didn't have a great repo recipe, but it was causing logs 7x/hr, so we'll know for sure before wmf.25 is rolled further at least. [15:57:08] makes sense [15:57:28] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.25/includes/parser/Sanitizer.php: 31189c6aa4dc880a9eebe6824dbc031e9109384f: Ensure that the recognizedTagData static cache is properly initialized (T303360) (duration: 00m 51s) [15:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:31] T303360: TypeError: Argument 3 passed to MediaWiki\Parser\RemexRemoveTagHandler::__construct() must be of the type array, null given, called in /srv/mediawiki/php-1.38.0-wmf.25/includes/parser/Sanitizer.php on line 367 - https://phabricator.wikimedia.org/T303360 [15:57:35] cscott: and should be live [15:57:41] anything else? [15:58:29] i've got a config change in the queue, but that can wait until the next backport window, it's not urgent [15:58:45] https://gerrit.wikimedia.org/r/c/763779/ [15:58:57] https://deploy-commands.toolforge.org/bacc/763779 [15:59:13] since it's not urgent, I'd say yeah, let's wait for the next B&C with it :) [15:59:23] works for me [15:59:25] thanks for your help! [15:59:30] np! [16:00:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:13] 10SRE, 10Znuny, 10serviceops: enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Arnoldokoth) [16:04:24] (03PS1) 10Vgutierrez: cache::haproxy: Ensure that old HAProxy instances die after 5m [puppet] - 10https://gerrit.wikimedia.org/r/769462 (https://phabricator.wikimedia.org/T290005) [16:04:58] (03PS14) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [16:08:55] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Ensure that old HAProxy instances die after 5m [puppet] - 10https://gerrit.wikimedia.org/r/769462 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:10:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host karapace1001.eqiad.wmnet [16:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:21] (03PS9) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [16:11:28] (03CR) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [16:13:57] (03PS1) 10Elukey: Add bullseye + overlayfs settings to kubernetes2008 [puppet] - 10https://gerrit.wikimedia.org/r/769463 (https://phabricator.wikimedia.org/T300744) [16:15:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34170/console" [puppet] - 10https://gerrit.wikimedia.org/r/769463 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:15:35] (03CR) 10Elukey: Add bullseye + overlayfs settings to kubernetes2008 [puppet] - 10https://gerrit.wikimedia.org/r/769463 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:15:58] (03PS15) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [16:16:00] (03PS1) 10Jbond: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) [16:16:14] (03Abandoned) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769447 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [16:16:38] (03CR) 10Cwhite: grafana ldap users sync: enable retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769142 (https://phabricator.wikimedia.org/T303064) (owner: 10Cwhite) [16:17:25] (03PS10) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [16:18:28] (03CR) 10jerkins-bot: [V: 04-1] O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [16:20:30] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:22:26] !log installing 5.10.103 kernels on bullseye hosts [16:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:30] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:33:58] 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace - https://phabricator.wikimedia.org/T301563 (10BTullis) 05Open→03Resolved Completed successfully. ` Created interface ##PRIMARY## on VM karapace1001 Attached IPv4 10.64.0.24/22 and IPv6 2620:0:861:101:10:64:0:24/64 to VM karapace1001 and marked as pr... [16:35:30] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:36:14] (03PS11) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) [16:36:16] (03PS16) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [16:36:18] (03PS2) 10Jbond: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) [16:36:20] (03PS1) 10Jbond: C:varnish: update templates netmapper public clouds [puppet] - 10https://gerrit.wikimedia.org/r/769466 [16:38:34] (03PS17) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [16:38:45] (03PS3) 10Jbond: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) [16:39:01] (03CR) 10JMeybohm: [C: 03+1] Add bullseye + overlayfs settings to kubernetes2008 [puppet] - 10https://gerrit.wikimedia.org/r/769463 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:39:45] (03PS4) 10Jbond: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) [16:40:07] (03PS2) 10Jbond: C:varnish: update templates netmapper public clouds [puppet] - 10https://gerrit.wikimedia.org/r/769466 [16:41:06] (03PS3) 10Jbond: C:varnish: update templates netmapper public clouds [puppet] - 10https://gerrit.wikimedia.org/r/769466 [16:42:01] (03CR) 10Elukey: [C: 03+2] Add bullseye + overlayfs settings to kubernetes2008 [puppet] - 10https://gerrit.wikimedia.org/r/769463 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:43:50] 10SRE, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Seen): Deploy multi-site plugin to gerrit1001 and gerrit2001 - https://phabricator.wikimedia.org/T217174 (10hashar) 05Open→03Declined It is really unlikely we will setup the multsite plugin though if we change our mind later we can always reope... [16:44:21] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:45:30] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:45:54] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2008.codfw.wmnet with OS bullseye [16:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:29] (03PS3) 10JHathaway: profile::mirrors: move mirrors module into profiles [puppet] - 10https://gerrit.wikimedia.org/r/767889 (https://phabricator.wikimedia.org/T300985) [16:46:42] (03CR) 10JHathaway: profile::mirrors: move mirrors module into profiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/767889 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [16:47:27] (03PS1) 10Btullis: Add boot configuration for karapace1001 [puppet] - 10https://gerrit.wikimedia.org/r/769468 (https://phabricator.wikimedia.org/T301562) [16:47:54] (03PS1) 10Jbond: varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) [16:48:06] (03PS2) 10Btullis: Add boot configuration for karapace1001 [puppet] - 10https://gerrit.wikimedia.org/r/769468 (https://phabricator.wikimedia.org/T301562) [16:48:13] (03CR) 10Jbond: [C: 03+2] O:external_clouds_vendors: New module for fetching cloud networks [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [16:48:36] (03CR) 10Jbond: [C: 03+2] O:external_clouds_vendors: New module for fetching cloud networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [16:48:50] (03CR) 10Btullis: "This will use bullseye by default as it is a new service and has no known restrictions." [puppet] - 10https://gerrit.wikimedia.org/r/769468 (https://phabricator.wikimedia.org/T301562) (owner: 10Btullis) [16:49:21] !log reboot rdb2008 for upgrades [16:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:51] (03PS18) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [16:49:59] (03PS5) 10Jbond: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) [16:50:05] (03PS4) 10Jbond: C:varnish: update templates netmapper public clouds [puppet] - 10https://gerrit.wikimedia.org/r/769466 [16:50:30] (03PS2) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) [16:50:30] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:50:38] (03PS3) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) [16:53:26] (KubernetesCalicoDown) firing: kubernetes2008.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:55:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/769468 (https://phabricator.wikimedia.org/T301562) (owner: 10Btullis) [16:55:14] (03PS1) 10Jbond: O:external_clouds_vendors: add dependency on systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/769471 [16:55:26] (03PS22) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [16:55:30] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:56:28] (03CR) 10Jbond: [C: 03+2] O:external_clouds_vendors: add dependency on systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/769471 (owner: 10Jbond) [16:56:31] !log reboot rdb[2008,2010].codfw.wmnet,rdb[1010,1012].eqiad.wmnet for upgrades [16:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:51] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7753274, @BTullis wrote: > How can I tell what the source IP address(es) of my services will be, as seen by the back... [16:57:33] PROBLEM - Host rdb2008 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:19] PROBLEM - Host rdb1012 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:27] PROBLEM - Host rdb2010 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:32] (03PS1) 10RLazarus: No-op change to test helm-lint, do not merge [deployment-charts] - 10https://gerrit.wikimedia.org/r/769472 [16:58:48] (03PS1) 10Volans: prospector: ignore deprecation message [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769473 [16:58:50] (03PS1) 10Volans: requests: allow to customize methods and codes [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769474 [16:58:52] (03CR) 10RLazarus: [C: 04-2] No-op change to test helm-lint, do not merge [deployment-charts] - 10https://gerrit.wikimedia.org/r/769472 (owner: 10RLazarus) [16:59:01] RECOVERY - Host rdb1012 is UP: PING OK - Packet loss = 0%, RTA = 2.75 ms [16:59:07] RECOVERY - Host rdb2008 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [16:59:07] RECOVERY - Host rdb2010 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [16:59:15] (03CR) 10jerkins-bot: [V: 04-1] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [16:59:27] akosiaris: didn't use the cookbook eh? :-P [17:00:12] volans: I didn't indeed. I should have I guess [17:00:30] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:01:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2008.codfw.wmnet with reason: host reimage [17:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:32] (03CR) 10jerkins-bot: [V: 04-1] No-op change to test helm-lint, do not merge [deployment-charts] - 10https://gerrit.wikimedia.org/r/769472 (owner: 10RLazarus) [17:02:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] gerrit: prevent 'null' entry in email [puppet] - 10https://gerrit.wikimedia.org/r/768005 (https://phabricator.wikimedia.org/T288312) (owner: 10Hashar) [17:03:01] btullis: snooping a little, your jerkins error looks like it's due to my envoy upgrade in helm-lint, sorry about that -- let me work on getting it unstuck [17:03:32] rzl: Many thanks. [17:03:53] Ha. I like jerkins :-) [17:04:42] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34171/console" [puppet] - 10https://gerrit.wikimedia.org/r/767889 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [17:04:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2008.codfw.wmnet with reason: host reimage [17:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:22] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet and cloudvirt1021.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Cmjohnson) I am able to update the BIOS but these servers were not initially purchased with the 10G... [17:07:33] PROBLEM - SSH on analytics1067.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:08:26] (KubernetesCalicoDown) resolved: kubernetes2008.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:10:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:30] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:12:51] (03PS1) 10Reedy: Mark removals of WebAuthn as done by self [extensions/WebAuthn] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769298 (https://phabricator.wikimedia.org/T303404) [17:13:01] (03PS1) 10Reedy: Mark removals of WebAuthn as done by self [extensions/WebAuthn] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769299 (https://phabricator.wikimedia.org/T303404) [17:13:41] (KubernetesCalicoDown) firing: kubernetes2008.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:16:49] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2008.codfw.wmnet with OS bullseye [17:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:41] (KubernetesCalicoDown) resolved: kubernetes2008.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [17:19:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Cmjohnson) @Dzahn I do not have enough space in row C to put more than 3 servers and all 3 of those are in one rack (C5). Can I put 3 in row C and then 3 in row... [17:20:09] (03CR) 10Reedy: [C: 03+2] Mark removals of WebAuthn as done by self [extensions/WebAuthn] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769299 (https://phabricator.wikimedia.org/T303404) (owner: 10Reedy) [17:20:13] (03CR) 10Reedy: [C: 03+2] Mark removals of WebAuthn as done by self [extensions/WebAuthn] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769298 (https://phabricator.wikimedia.org/T303404) (owner: 10Reedy) [17:20:30] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:24:35] (03Merged) 10jenkins-bot: Mark removals of WebAuthn as done by self [extensions/WebAuthn] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/769299 (https://phabricator.wikimedia.org/T303404) (owner: 10Reedy) [17:24:37] (03Merged) 10jenkins-bot: Mark removals of WebAuthn as done by self [extensions/WebAuthn] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769298 (https://phabricator.wikimedia.org/T303404) (owner: 10Reedy) [17:24:53] jouncebot: nowandnext [17:24:53] No deployments scheduled for the next 1 hour(s) and 35 minute(s) [17:24:53] In 1 hour(s) and 35 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T1900) [17:24:53] In 1 hour(s) and 35 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T1900) [17:26:30] (03PS2) 10JHathaway: profile::mirrros: switch to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) [17:27:29] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34172/console" [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [17:28:39] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.24/extensions/WebAuthn/: T303404 (duration: 00m 51s) [17:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:43] T303404: ArgumentCountError: Too few arguments to function MediaWiki\Extension\OATHAuth\OATHUserRepository::remove(), 2 passed in /srv/mediawiki/php-1.38.0-wmf.25/extensions/WebAuthn/src/HTMLForm/WebAuthnDisableForm.php on line 114 and exactly 3 expected - https://phabricator.wikimedia.org/T303404 [17:29:08] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:33] (03PS15) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:29:35] (03PS5) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:29:37] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/WebAuthn/: T303404 (duration: 00m 53s) [17:29:37] (03PS5) 10Giuseppe Lavagetto: cache: turn on dynamic bans on all of eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769389 (https://phabricator.wikimedia.org/T302471) [17:29:39] (03PS5) 10Giuseppe Lavagetto: cache: enable dynamic bans everywhere [puppet] - 10https://gerrit.wikimedia.org/r/769390 (https://phabricator.wikimedia.org/T302471) [17:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:52] (03PS3) 10Btullis: Add boot configuration for karapace1001 [puppet] - 10https://gerrit.wikimedia.org/r/769468 (https://phabricator.wikimedia.org/T301562) [17:31:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:16] (03CR) 10Btullis: Add boot configuration for karapace1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769468 (https://phabricator.wikimedia.org/T301562) (owner: 10Btullis) [17:31:57] (03PS4) 10Btullis: Add boot configuration for karapace1001 [puppet] - 10https://gerrit.wikimedia.org/r/769468 (https://phabricator.wikimedia.org/T301562) [17:31:59] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [17:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:02] (03PS16) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:32:04] (03PS6) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:32:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS b... [17:32:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:32:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:01] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34174/console" [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [17:33:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:33:19] (03PS1) 10Cathal Mooney: Initial changes to Homer config and templates for EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/769478 (https://phabricator.wikimedia.org/T299758) [17:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:53] (03CR) 10jerkins-bot: [V: 04-1] Initial changes to Homer config and templates for EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/769478 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [17:34:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Cmjohnson) [17:36:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:26] (03PS17) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:36:30] (03PS7) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:36:31] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300775)', diff saved to https://phabricator.wikimedia.org/P22217 and previous config saved to /var/cache/conftool/dbconfig/20220309-173630-marostegui.json [17:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:34] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [17:36:57] (03CR) 10jerkins-bot: [V: 04-1] varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [17:38:33] (03PS18) 10Giuseppe Lavagetto: varnish/frontend: consume etcd data for dynamic banning of requests. [puppet] - 10https://gerrit.wikimedia.org/r/763557 (https://phabricator.wikimedia.org/T302471) [17:38:35] (03PS8) 10Giuseppe Lavagetto: varnish: enable dynamic bans on one host per cluster in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) [17:40:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34176/console" [puppet] - 10https://gerrit.wikimedia.org/r/769388 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [17:41:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:19] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1047.eqiad.wmnet with OS bullseye [17:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bulls... [17:42:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:42:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:42:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Andrew) I just now re-ran the imaging script and it looks like a failure to pxe boot [17:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:32] (03CR) 10CDanis: Enable profile::auto_restarts::service for klaxon gunicorn webapp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:43:41] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet and cloudvirt1021.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) Similar problems with cloudvirt1047: T293391 [17:46:06] (03PS2) 10Cathal Mooney: Initial changes to Homer config and templates for EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/769478 (https://phabricator.wikimedia.org/T299758) [17:47:36] (03PS4) 10Jcrespo: Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) [17:48:04] (03CR) 10jerkins-bot: [V: 04-1] Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) (owner: 10Jcrespo) [17:51:10] (03CR) 10CDanis: "some quick followups" [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [17:51:41] (03PS5) 10Jcrespo: Check that xtrabackup --prepare is using the same version [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/769428 (https://phabricator.wikimedia.org/T253959) [17:52:07] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22219 and previous config saved to /var/cache/conftool/dbconfig/20220309-175205-marostegui.json [17:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:34] (03CR) 10RLazarus: [C: 04-2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769472 (owner: 10RLazarus) [18:02:12] 10SRE, 10VPS-project-Codesearch: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10Krinkle) [18:02:57] (03PS23) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [18:03:00] (03CR) 10RLazarus: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [18:03:43] rzl: snap --^ [18:04:18] haha outraced :D sorry for the inconvenience, you should be back on track [18:04:30] if that fails again I'll double-check that it's not my fault in some new innovative way [18:05:26] (03CR) 10Btullis: [C: 03+2] Add boot configuration for karapace1001 [puppet] - 10https://gerrit.wikimedia.org/r/769468 (https://phabricator.wikimedia.org/T301562) (owner: 10Btullis) [18:05:59] (03CR) 10Btullis: [C: 03+2] Add boot configuration for karapace1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769468 (https://phabricator.wikimedia.org/T301562) (owner: 10Btullis) [18:06:01] (03PS1) 10Phuedx: beta: Include mediawiki.ipinfo_interaction in $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769484 [18:07:42] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22220 and previous config saved to /var/cache/conftool/dbconfig/20220309-180741-marostegui.json [18:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:05] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) > Those will be the IP ranges of the different k8s clusters (the non-ML ones). You can look those up in netbox: https://netbox.wikim... [18:14:20] rzl: worked like a charm. Many thanks. [18:14:47] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:15:08] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [18:15:12] btullis: \i/ [18:15:14] er, \o/ [18:23:17] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300775)', diff saved to https://phabricator.wikimedia.org/P22221 and previous config saved to /var/cache/conftool/dbconfig/20220309-182316-marostegui.json [18:23:19] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [18:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:21] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [18:23:21] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [18:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:56] !log marostegui@cumin2002 dbctl commit (dc=all): 'Depooling db1127 (T300775)', diff saved to https://phabricator.wikimedia.org/P22222 and previous config saved to /var/cache/conftool/dbconfig/20220309-182355-marostegui.json [18:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:05] (03CR) 10JMeybohm: [C: 04-1] "Did not manage to read everything yet, submitting the first set of comments anyways." [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [18:41:21] jouncebot: nowandnext [18:41:21] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [18:41:22] In 0 hour(s) and 18 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T1900) [18:41:22] In 0 hour(s) and 18 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T1900) [18:42:08] borrowing mwdebug1001 for an envoy update, will be done or rolled back before the train window [18:42:21] 👍🏾 [18:46:42] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/769466 (owner: 10Jbond) [18:48:11] so far so good -- updating on mw1414 and restbase1016 too, then I'll take my hands off and let that bake for a while [18:52:11] (03CR) 10Jbond: [C: 03+1] prospector: ignore deprecation message [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769473 (owner: 10Volans) [18:53:00] (03CR) 10Jbond: [C: 03+1] requests: allow to customize methods and codes [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769474 (owner: 10Volans) [18:54:20] (03CR) 10Volans: [C: 03+2] prospector: ignore deprecation message [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769473 (owner: 10Volans) [18:54:29] (03CR) 10Volans: [C: 03+2] requests: allow to customize methods and codes [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769474 (owner: 10Volans) [18:56:55] (03Merged) 10jenkins-bot: prospector: ignore deprecation message [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769473 (owner: 10Volans) [18:57:00] (03Merged) 10jenkins-bot: requests: allow to customize methods and codes [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769474 (owner: 10Volans) [18:57:14] looking good, although it seems like there was more latency impact from the update itself than I expected -- I'll dig into logs for why that is, before rolling this out everywhere [18:57:23] done for now though, have a good train! [18:58:20] thx. [19:00:04] dancy and brennen: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T1900). [19:00:04] dancy and brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T1900). [19:00:10] o/ [19:01:33] (ah, i see we're still blocked.) [19:02:43] I _think_ https://phabricator.wikimedia.org/T303360 might be handled already. I pinged cscott on the ticket. No answer yet. [19:03:24] yeah, looks like the backport was deployed earlier, and i don't see that error in the last 2 hrs. [19:03:55] ok. Moving forward [19:04:19] (03PS1) 10Ahmon Dancy: group1 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769490 [19:04:21] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769490 (owner: 10Ahmon Dancy) [19:04:59] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769490 (owner: 10Ahmon Dancy) [19:06:15] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.25 refs T300201 [19:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:21] T300201: 1.38.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T300201 [19:07:05] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.25 refs T300201 (duration: 00m 49s) [19:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:16] (03Abandoned) 10RLazarus: No-op change to test helm-lint, do not merge [deployment-charts] - 10https://gerrit.wikimedia.org/r/769472 (owner: 10RLazarus) [19:09:31] (03PS19) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) [19:09:33] (03PS6) 10Jbond: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) [19:09:35] (03PS5) 10Jbond: C:varnish: update templates netmapper public clouds [puppet] - 10https://gerrit.wikimedia.org/r/769466 [19:09:37] (03PS4) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) [19:09:39] (03PS1) 10Jbond: C:external_clouds_vendors: follow up fixes [puppet] - 10https://gerrit.wikimedia.org/r/769492 [19:09:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:10:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:29] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [19:16:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:17:04] Rolling back to group0. [19:17:28] (03CR) 10Jbond: C:varnish: update templates netmapper public clouds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769466 (owner: 10Jbond) [19:17:32] (03PS1) 10Ahmon Dancy: group1 wikis to 1.38.0-wmf.24 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769495 [19:17:34] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.38.0-wmf.24 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769495 (owner: 10Ahmon Dancy) [19:17:38] ack [19:18:59] RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:00] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.24 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769495 (owner: 10Ahmon Dancy) [19:20:16] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.24 refs T300201 [19:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:20] T300201: 1.38.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T300201 [19:20:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:06] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.24 refs T300201 (duration: 00m 50s) [19:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:22:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:57] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:24:53] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:26:21] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:27:13] PROBLEM - Check systemd state on datahubsearch1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:29:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:30:17] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:09] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:36:52] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10dr0ptp4kt) Hi all. Following up from email thread. The request here is for delegated owner access to @SCherukuwada's Google Workspace account via the following: https://www.googl... [19:41:12] (03PS1) 10Zabe: Bump the cache version of Gadget [extensions/Gadgets] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769304 (https://phabricator.wikimedia.org/T303391) [19:41:33] (03PS2) 10Zabe: Bump the cache version of Gadget [extensions/Gadgets] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769304 (https://phabricator.wikimedia.org/T303391) [19:43:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt1047.mgmt.eqiad.wmnet with reboot policy FORCED [19:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:37] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.1.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769506 [19:44:35] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 43, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:44:42] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.1.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769506 (owner: 10Volans) [19:45:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirt1047.mgmt.eqiad.wmnet with reboot policy FORCED [19:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:05] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:46:27] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:47:17] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:25] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:47:40] (03PS1) 10Majavah: policies/cr-labs: Allow tftp to install servers [homer/public] - 10https://gerrit.wikimedia.org/r/769508 (https://phabricator.wikimedia.org/T303296) [19:47:45] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.1.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769506 (owner: 10Volans) [19:48:53] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:49:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/767889 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [19:49:13] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:51:34] (03PS1) 10Volans: Upstream release v1.1.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/769509 [19:51:57] (03CR) 10Volans: [C: 03+2] Upstream release v1.1.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/769509 (owner: 10Volans) [19:53:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [19:54:23] (03Merged) 10jenkins-bot: Upstream release v1.1.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/769509 (owner: 10Volans) [19:54:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [19:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1047.eqiad.wmnet with OS b... [19:54:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye [19:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1047.eqiad.wmnet with OS bulls... [19:57:04] 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10RobH) p:05Triage→03Medium [20:00:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt1047.mgmt.eqiad.wmnet with reboot policy FORCED [20:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:49] 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10RobH) These were setup prior to the new bios firmware automation script, but does that script set the PDU draw to even split? (@Papaul, do you know by chance?) [20:07:33] 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10Papaul) @rob yes just run the script with the options --no-dhcp --no-users [20:10:14] (03CR) 10JHathaway: [V: 03+1 C: 03+2] profile::mirrors: move mirrors module into profiles [puppet] - 10https://gerrit.wikimedia.org/r/767889 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [20:10:16] 10ops-drmrs: drmrs power draw isn't evenly split - https://phabricator.wikimedia.org/T303468 (10RobH) >>! In T303468#7765502, @Papaul wrote: > @rob yes just run the script with the options --no-dhcp --no-users This is what I was hoping, thank you! [20:12:59] RECOVERY - SSH on analytics1067.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:17:37] (03PS5) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) [20:17:39] (03PS1) 10Jbond: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 [20:18:23] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:20:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt1047.mgmt.eqiad.wmnet with reboot policy FORCED [20:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [20:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1047.eqiad.wmnet with OS b... [20:21:04] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye [20:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1047.eqiad.wmnet with OS bulls... [20:23:09] (03PS6) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) [20:29:56] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster::gitclone: small clean up patch [puppet] - 10https://gerrit.wikimedia.org/r/769445 (owner: 10Jbond) [20:31:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Bump changelong for including latest workflow_utils [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/767830 (owner: 10Ottomata) [20:31:57] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:33:51] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:33:56] (03CR) 10CDanis: [C: 03+1] "awesome thank you" [puppet] - 10https://gerrit.wikimedia.org/r/769492 (owner: 10Jbond) [20:35:03] (03PS3) 10JHathaway: profile::mirrros: switch to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) [20:38:12] (03CR) 10CDanis: [C: 03+1] C:external_clouds_vendors: follow up fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769492 (owner: 10Jbond) [20:38:51] (03CR) 10CDanis: [C: 03+1] C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [20:39:51] (03PS4) 10JHathaway: profile::mirrros: switch to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) [20:40:20] (03CR) 10CDanis: [C: 03+1] "LGTM but of course I would roll out carefully (disable puppet, start with one host, make sure reload is successful)" [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [20:40:52] (03CR) 10Reedy: [C: 03+2] Bump the cache version of Gadget [extensions/Gadgets] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769304 (https://phabricator.wikimedia.org/T303391) (owner: 10Zabe) [20:41:02] (03PS1) 10Reedy: Bump MediaWikiGadgetsDefinitionRepo cache version [extensions/Gadgets] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769305 (https://phabricator.wikimedia.org/T303455) [20:41:04] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34178/console" [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [20:41:09] (03PS2) 10Reedy: Bump MediaWikiGadgetsDefinitionRepo cache version [extensions/Gadgets] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769305 (https://phabricator.wikimedia.org/T303455) [20:41:15] (03CR) 10Reedy: [C: 03+2] Bump MediaWikiGadgetsDefinitionRepo cache version [extensions/Gadgets] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769305 (https://phabricator.wikimedia.org/T303455) (owner: 10Reedy) [20:43:05] (03CR) 10Ayounsi: [C: 04-1] "That shouldn't be needed as the install servers have public IPs, and this filter only discards traffic to private IPs." [homer/public] - 10https://gerrit.wikimedia.org/r/769508 (https://phabricator.wikimedia.org/T303296) (owner: 10Majavah) [20:43:36] (03CR) 10CDanis: O:external_clouds_vendors: New module for fetching cloud networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [20:44:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) [20:45:52] (03PS5) 10JHathaway: profile::mirrros: switch to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) [20:46:18] (03PS1) 10Volans: requests: fix backward compatibility with urllib3 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769518 [20:46:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) @RobH I updated the task to call out these should be installed into two different rows, as well as not installing in WM... [20:46:53] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34179/console" [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [20:47:23] (03CR) 10CDanis: [C: 03+1] C:varnish: update templates netmapper public clouds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769466 (owner: 10Jbond) [20:48:13] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955 [20:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:16] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [20:48:22] (03PS6) 10JHathaway: profile::mirrros: switch to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) [20:48:43] (03CR) 10Volans: "I had totally forgot we had to do the same in debmonitor (see Ib8d83c8fb0948176a8b2f154b921a8442550e3ef )" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769518 (owner: 10Volans) [20:49:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) Note, the existing machines are taking up a total of 12U (6U each) in D2 and A4. [20:49:21] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955 [20:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:54] (03CR) 10JHathaway: [C: 03+2] profile::mirrros: switch to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [20:50:21] (03CR) 10JHathaway: [C: 03+2] profile::mirrros: switch to apache2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [20:51:15] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955 [20:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:17] (03CR) 10Jbond: [C: 03+1] "😄" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769518 (owner: 10Volans) [20:53:04] (03CR) 10Volans: [C: 03+2] requests: fix backward compatibility with urllib3 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769518 (owner: 10Volans) [20:55:55] (03PS1) 10Volans: spicerack: make http_session more flexible [software/spicerack] - 10https://gerrit.wikimedia.org/r/769520 [20:56:16] (03PS2) 10Jbond: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 [20:56:26] (03Merged) 10jenkins-bot: requests: fix backward compatibility with urllib3 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769518 (owner: 10Volans) [20:57:43] (03Merged) 10jenkins-bot: Bump the cache version of Gadget [extensions/Gadgets] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769304 (https://phabricator.wikimedia.org/T303391) (owner: 10Zabe) [20:58:48] (03PS1) 10JHathaway: profile::mirrors: add headers module for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/769521 (https://phabricator.wikimedia.org/T300985) [20:59:06] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.1.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769522 [20:59:20] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.1.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769522 (owner: 10Volans) [21:00:05] RoanKattouw and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220309T2100). [21:00:05] cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:18] (03Merged) 10jenkins-bot: Bump MediaWikiGadgetsDefinitionRepo cache version [extensions/Gadgets] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769305 (https://phabricator.wikimedia.org/T303455) (owner: 10Reedy) [21:01:59] (03CR) 10JHathaway: [C: 03+2] profile::mirrors: add headers module for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/769521 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [21:02:33] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.1.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769522 (owner: 10Volans) [21:03:37] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:44] (03PS1) 10Volans: Upstream release v1.1.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/769523 [21:05:23] (03CR) 10Volans: [C: 03+2] Upstream release v1.1.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/769523 (owner: 10Volans) [21:06:06] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955 [21:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:09] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [21:06:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:21] RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:00] (03CR) 10CDanis: O:external_clouds_vendors: New module for fetching cloud networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [21:07:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:07:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:26] (03Merged) 10jenkins-bot: Upstream release v1.1.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/769523 (owner: 10Volans) [21:10:00] (03PS2) 10Jbond: C:external_clouds_vendors: follow up fixes [puppet] - 10https://gerrit.wikimedia.org/r/769492 [21:10:11] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955 [21:10:11] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955 [21:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:16] (03CR) 10CDanis: [C: 03+2] C:external_clouds_vendors: follow up fixes [puppet] - 10https://gerrit.wikimedia.org/r/769492 (owner: 10Jbond) [21:20:56] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:25:42] (03PS1) 10Volans: requests: fix backward compatibility with urllib3 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769527 [21:33:37] (03CR) 10Volans: [C: 03+2] "I'll self-merge to unblock the release but please feel free to comment, I'll make a follow up patch if needed." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769527 (owner: 10Volans) [21:36:14] (03Merged) 10jenkins-bot: requests: fix backward compatibility with urllib3 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769527 (owner: 10Volans) [21:37:18] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10jhathaway) 05Open→03Resolved @aborrero the mirrors server has now been switched to apache2 and I am unable to reproduce the error with my tests. Please reopen if you experie... [21:38:01] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.1.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769534 [21:38:16] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.1.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769534 (owner: 10Volans) [21:38:43] (03CR) 10Cwhite: [C: 03+2] eventlogging: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763827 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [21:38:55] (03CR) 10Cwhite: [C: 03+2] hadoop: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763831 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [21:41:11] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.1.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/769534 (owner: 10Volans) [21:44:46] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300775)', diff saved to https://phabricator.wikimedia.org/P22225 and previous config saved to /var/cache/conftool/dbconfig/20220309-214445-marostegui.json [21:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:50] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [21:45:06] (03PS1) 10Volans: Upstream release v1.1.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/769535 [21:46:29] (03CR) 10Volans: [C: 03+2] Upstream release v1.1.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/769535 (owner: 10Volans) [21:48:53] (03Merged) 10jenkins-bot: Upstream release v1.1.2 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/769535 (owner: 10Volans) [21:50:22] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [21:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:55] (03PS1) 10Jbond: C:pupetmaster: add support for netbox-hiera git repo [puppet] - 10https://gerrit.wikimedia.org/r/769538 [21:53:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:09] !log uploaded python3-wmflib_1.1.2 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [21:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:19] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/Gadgets: T303455 (duration: 00m 50s) [21:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:22] T303455: TypeError: Argument 1 passed to MediaWiki\Extension\Gadgets\GadgetRepo::getGadgetDefinitionTitle() must be of the type string, null given, called in /srv/mediawiki/php-1.38.0-wmf.25/extensions/Gadgets/includes/SpecialGadgets.php on line 114 - https://phabricator.wikimedia.org/T303455 [21:57:29] (03CR) 10Jbond: "fyi i saw you +2 this, not sure if that was international but feel free to merge if you wanted to do more testing, otherwise ill pick up t" [puppet] - 10https://gerrit.wikimedia.org/r/769492 (owner: 10Jbond) [21:59:13] (03CR) 10Jbond: [C: 03+1] "seems reasonable" [software/spicerack] - 10https://gerrit.wikimedia.org/r/769520 (owner: 10Volans) [22:00:21] (03PS2) 10Jbond: C:pupetmaster: add support for netbox-hiera git repo [puppet] - 10https://gerrit.wikimedia.org/r/769538 [22:00:21] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22226 and previous config saved to /var/cache/conftool/dbconfig/20220309-220020-marostegui.json [22:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:39] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:04:59] (03PS3) 10Jbond: C:pupetmaster: add support for netbox-hiera git repo [puppet] - 10https://gerrit.wikimedia.org/r/769538 [22:06:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34182/console" [puppet] - 10https://gerrit.wikimedia.org/r/769538 (owner: 10Jbond) [22:08:50] (03CR) 10Jbond: "realising this is essentially a noop ill merge" [puppet] - 10https://gerrit.wikimedia.org/r/769492 (owner: 10Jbond) [22:15:56] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22228 and previous config saved to /var/cache/conftool/dbconfig/20220309-221555-marostegui.json [22:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:33] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:06] (03CR) 10Volans: [C: 03+2] spicerack: make http_session more flexible [software/spicerack] - 10https://gerrit.wikimedia.org/r/769520 (owner: 10Volans) [22:22:56] 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Platform Engineering, 10Wikimedia-Site-requests, and 3 others: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits - https://phabricator.wikimedia.org/T303433 (10sbassett) >>! In T303433#7765884, @matmarex wrote: > (I'd like to note that the perm... [22:24:37] 10SRE, 10ConfirmEdit (CAPTCHA extension), 10Platform Engineering, 10Wikimedia-Site-requests, and 3 others: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits - https://phabricator.wikimedia.org/T303433 (10sbassett) [22:27:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10ShubhankarP) [22:29:36] (03Merged) 10jenkins-bot: spicerack: make http_session more flexible [software/spicerack] - 10https://gerrit.wikimedia.org/r/769520 (owner: 10Volans) [22:31:31] !log marostegui@cumin2002 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300775)', diff saved to https://phabricator.wikimedia.org/P22229 and previous config saved to /var/cache/conftool/dbconfig/20220309-223130-marostegui.json [22:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:35] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [22:33:01] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ShubhankarP - https://phabricator.wikimedia.org/T303032 (10ShubhankarP) [22:35:54] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:56] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:35:57] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirt1047.eqiad.wmnet [22:54:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cloudvirt1047.eqiad.wmnet [22:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:38] (03PS1) 10Jbond: external_cloud_vendors: some addtional follow up fixes [puppet] - 10https://gerrit.wikimedia.org/r/769572 [22:59:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cloudvirt1047.eqiad.wmnet [22:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:01] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudvirt1047.eqiad.wmnet [23:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:49] jouncebot now [23:03:49] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [23:05:30] Rolling train forward to group1 again [23:05:56] (03PS1) 10Ahmon Dancy: group1 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769573 [23:05:58] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769573 (owner: 10Ahmon Dancy) [23:07:07] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769573 (owner: 10Ahmon Dancy) [23:08:23] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.25 refs T300201 [23:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:27] T300201: 1.38.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T300201 [23:08:39] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:44] !log marostegui@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [23:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:47] !log marostegui@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [23:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:13] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.25 refs T300201 (duration: 00m 49s) [23:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:39] so far so good... [23:14:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:16:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:17] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:46:16] (03CR) 10Ryan Kemper: elastic: relax & restore perms during upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [23:57:20] (03PS11) 10Jbond: C:varnish::common: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/768739