[00:00:15] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T329203)', diff saved to https://phabricator.wikimedia.org/P45337 and previous config saved to /var/cache/conftool/dbconfig/20230308-000203-marostegui.json [00:02:10] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [00:05:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T328817)', diff saved to https://phabricator.wikimedia.org/P45338 and previous config saved to /var/cache/conftool/dbconfig/20230308-000516-marostegui.json [00:05:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [00:05:24] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [00:05:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [00:05:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T328817)', diff saved to https://phabricator.wikimedia.org/P45339 and previous config saved to /var/cache/conftool/dbconfig/20230308-000538-marostegui.json [00:10:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P45340 and previous config saved to /var/cache/conftool/dbconfig/20230308-001036-marostegui.json [00:14:31] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: dispatch-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P45341 and previous config saved to /var/cache/conftool/dbconfig/20230308-001709-marostegui.json [00:17:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T328817)', diff saved to https://phabricator.wikimedia.org/P45342 and previous config saved to /var/cache/conftool/dbconfig/20230308-001734-marostegui.json [00:17:41] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [00:24:26] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [00:24:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [00:25:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P45343 and previous config saved to /var/cache/conftool/dbconfig/20230308-002543-marostegui.json [00:25:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [00:25:57] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:29:37] !log brett@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ncredir1002.eqiad.wmnet with OS bullseye [00:29:46] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir1002.eqiad.wmnet with OS bullseye executed with errors: - ncredir1002 (**FAIL**) - Down... [00:31:13] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:16] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host ncredir1002.eqiad.wmnet with OS bullseye [00:32:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P45344 and previous config saved to /var/cache/conftool/dbconfig/20230308-003216-marostegui.json [00:32:27] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host ncredir1002.eqiad.wmnet with OS bullseye [00:32:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P45345 and previous config saved to /var/cache/conftool/dbconfig/20230308-003240-marostegui.json [00:40:37] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:40:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T329260)', diff saved to https://phabricator.wikimedia.org/P45346 and previous config saved to /var/cache/conftool/dbconfig/20230308-004049-marostegui.json [00:40:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [00:40:57] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [00:41:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [00:41:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [00:41:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [00:41:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T329260)', diff saved to https://phabricator.wikimedia.org/P45347 and previous config saved to /var/cache/conftool/dbconfig/20230308-004115-marostegui.json [00:43:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T329260)', diff saved to https://phabricator.wikimedia.org/P45348 and previous config saved to /var/cache/conftool/dbconfig/20230308-004341-marostegui.json [00:47:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T329203)', diff saved to https://phabricator.wikimedia.org/P45349 and previous config saved to /var/cache/conftool/dbconfig/20230308-004722-marostegui.json [00:47:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:47:30] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [00:47:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:47:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T329203)', diff saved to https://phabricator.wikimedia.org/P45350 and previous config saved to /var/cache/conftool/dbconfig/20230308-004744-marostegui.json [00:47:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P45351 and previous config saved to /var/cache/conftool/dbconfig/20230308-004753-marostegui.json [00:51:51] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage [00:55:02] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage [00:58:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P45352 and previous config saved to /var/cache/conftool/dbconfig/20230308-005848-marostegui.json [01:01:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T329203)', diff saved to https://phabricator.wikimedia.org/P45353 and previous config saved to /var/cache/conftool/dbconfig/20230308-010117-marostegui.json [01:01:26] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [01:03:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T328817)', diff saved to https://phabricator.wikimedia.org/P45354 and previous config saved to /var/cache/conftool/dbconfig/20230308-010300-marostegui.json [01:03:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [01:03:07] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [01:03:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [01:03:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T328817)', diff saved to https://phabricator.wikimedia.org/P45355 and previous config saved to /var/cache/conftool/dbconfig/20230308-010321-marostegui.json [01:08:12] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host ncredir1002.eqiad.wmnet with OS bullseye [01:08:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host ncredir1002.eqiad.wmnet with OS bullseye completed: - ncredir1002 (**PASS**) - Removed from Pu... [01:09:15] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=ncredir1002.eqiad.wmnet [01:09:57] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [01:13:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P45356 and previous config saved to /var/cache/conftool/dbconfig/20230308-011354-marostegui.json [01:14:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T328817)', diff saved to https://phabricator.wikimedia.org/P45357 and previous config saved to /var/cache/conftool/dbconfig/20230308-011401-marostegui.json [01:14:09] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [01:16:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P45358 and previous config saved to /var/cache/conftool/dbconfig/20230308-011624-marostegui.json [01:22:40] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:29:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T329260)', diff saved to https://phabricator.wikimedia.org/P45359 and previous config saved to /var/cache/conftool/dbconfig/20230308-012901-marostegui.json [01:29:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [01:29:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [01:29:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P45360 and previous config saved to /var/cache/conftool/dbconfig/20230308-012908-marostegui.json [01:29:09] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [01:29:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45361 and previous config saved to /var/cache/conftool/dbconfig/20230308-012918-marostegui.json [01:31:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P45362 and previous config saved to /var/cache/conftool/dbconfig/20230308-013131-marostegui.json [01:35:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45363 and previous config saved to /var/cache/conftool/dbconfig/20230308-013534-marostegui.json [01:35:41] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [01:44:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P45364 and previous config saved to /var/cache/conftool/dbconfig/20230308-014415-marostegui.json [01:46:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T329203)', diff saved to https://phabricator.wikimedia.org/P45365 and previous config saved to /var/cache/conftool/dbconfig/20230308-014637-marostegui.json [01:46:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [01:46:45] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [01:46:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [01:47:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T329203)', diff saved to https://phabricator.wikimedia.org/P45366 and previous config saved to /var/cache/conftool/dbconfig/20230308-014659-marostegui.json [01:50:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P45367 and previous config saved to /var/cache/conftool/dbconfig/20230308-015040-marostegui.json [01:58:10] (03CR) 10Cwhite: "I'm not aware of anything that needs mod_cache_disk on these hosts. I'm in favor of this or disabling it via Puppet." [puppet] - 10https://gerrit.wikimedia.org/r/895144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [01:59:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T328817)', diff saved to https://phabricator.wikimedia.org/P45368 and previous config saved to /var/cache/conftool/dbconfig/20230308-015921-marostegui.json [01:59:29] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [02:00:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T329203)', diff saved to https://phabricator.wikimedia.org/P45369 and previous config saved to /var/cache/conftool/dbconfig/20230308-020016-marostegui.json [02:00:23] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [02:05:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P45370 and previous config saved to /var/cache/conftool/dbconfig/20230308-020547-marostegui.json [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P45371 and previous config saved to /var/cache/conftool/dbconfig/20230308-021523-marostegui.json [02:20:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45372 and previous config saved to /var/cache/conftool/dbconfig/20230308-022054-marostegui.json [02:20:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [02:21:02] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [02:21:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [02:21:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T329260)', diff saved to https://phabricator.wikimedia.org/P45373 and previous config saved to /var/cache/conftool/dbconfig/20230308-022116-marostegui.json [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T329260)', diff saved to https://phabricator.wikimedia.org/P45374 and previous config saved to /var/cache/conftool/dbconfig/20230308-022726-marostegui.json [02:27:35] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [02:30:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P45375 and previous config saved to /var/cache/conftool/dbconfig/20230308-023029-marostegui.json [02:32:15] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data point - https://phabricator.wikimedia.org/T324675 (10Tgr) The user impact module... [02:42:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P45376 and previous config saved to /var/cache/conftool/dbconfig/20230308-024233-marostegui.json [02:45:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T329203)', diff saved to https://phabricator.wikimedia.org/P45377 and previous config saved to /var/cache/conftool/dbconfig/20230308-024536-marostegui.json [02:45:44] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [02:57:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P45378 and previous config saved to /var/cache/conftool/dbconfig/20230308-025739-marostegui.json [03:12:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T329260)', diff saved to https://phabricator.wikimedia.org/P45379 and previous config saved to /var/cache/conftool/dbconfig/20230308-031246-marostegui.json [03:12:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [03:12:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [03:12:54] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [03:12:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45380 and previous config saved to /var/cache/conftool/dbconfig/20230308-031257-marostegui.json [03:19:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45381 and previous config saved to /var/cache/conftool/dbconfig/20230308-031910-marostegui.json [03:19:18] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [03:34:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P45382 and previous config saved to /var/cache/conftool/dbconfig/20230308-033416-marostegui.json [03:49:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P45383 and previous config saved to /var/cache/conftool/dbconfig/20230308-034923-marostegui.json [03:51:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10DMartin-WMF) [04:04:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T329260)', diff saved to https://phabricator.wikimedia.org/P45384 and previous config saved to /var/cache/conftool/dbconfig/20230308-040430-marostegui.json [04:04:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [04:04:38] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [04:04:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [04:04:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T329260)', diff saved to https://phabricator.wikimedia.org/P45385 and previous config saved to /var/cache/conftool/dbconfig/20230308-040451-marostegui.json [04:15:24] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:47:52] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:05:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T329260)', diff saved to https://phabricator.wikimedia.org/P45386 and previous config saved to /var/cache/conftool/dbconfig/20230308-050517-marostegui.json [05:05:26] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [05:20:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P45387 and previous config saved to /var/cache/conftool/dbconfig/20230308-052024-marostegui.json [05:35:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P45388 and previous config saved to /var/cache/conftool/dbconfig/20230308-053531-marostegui.json [05:50:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T329260)', diff saved to https://phabricator.wikimedia.org/P45389 and previous config saved to /var/cache/conftool/dbconfig/20230308-055038-marostegui.json [05:50:45] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [06:08:31] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Joe) >>! In T331138#8664269, @MatthewVernon wrote: > From a Data Persistence POV, thumbs are ephemeral / cached - we are... [06:09:25] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Joe) Also: if pre-generation of thumbs makes sense (does it? do we have any numbers on this stuff?) then it should happen... [06:10:30] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Joe) @thcipriani what does the #unstewarded-production-error tag mean, in practice? Is there a process to get someone to... [06:12:59] (03PS1) 10DLynch: Release DiscussionTools on mobile on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895375 (https://phabricator.wikimedia.org/T328942) [06:18:58] (03PS1) 10DLynch: Switch order of "Add topic" and language dropdown [skins/Vector] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895297 (https://phabricator.wikimedia.org/T267444) [06:20:01] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10Wikidata.org, and 2 others: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 (10Joe) >>! In T331405#8672360, @dcausse wrote: >>>! In T331405#8672341, @Joe wrote: >> Updates shouldn't depend... [06:31:16] PROBLEM - Check systemd state on arclamp2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:47:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:47:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384 [06:47:59] T331384: Switchover m3 master db1159 -> db1101 - https://phabricator.wikimedia.org/T331384 [06:48:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384 [06:48:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331387 [06:48:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331387 [06:48:22] T331387: Switchover m3 master db1101 -> db1159 - https://phabricator.wikimedia.org/T331387 [06:48:40] (03PS1) 10Marostegui: Revert "mariadb: Promote db1101 to m3 master" [puppet] - 10https://gerrit.wikimedia.org/r/895299 [06:48:46] (03PS2) 10Marostegui: Revert "mariadb: Promote db1101 to m3 master" [puppet] - 10https://gerrit.wikimedia.org/r/895299 [06:49:55] Going to switchover phabricator master - phabricator will be on read only for around 1 minute [06:52:21] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote db1101 to m3 master" [puppet] - 10https://gerrit.wikimedia.org/r/895299 (owner: 10Marostegui) [06:53:42] !log Failover m3 from db1101 to db1159 - T331387 [06:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:49] T331387: Switchover m3 master db1101 -> db1159 - https://phabricator.wikimedia.org/T331387 [06:56:09] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10Marostegui) [06:57:28] (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/895394 (https://phabricator.wikimedia.org/T330165) [06:59:41] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/895394 (https://phabricator.wikimedia.org/T330165) (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T0700) [07:00:33] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) m1-master and m2-master proxies failed over [07:01:13] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [07:03:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:03:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [07:04:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [07:04:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [07:04:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T329260)', diff saved to https://phabricator.wikimedia.org/P45390 and previous config saved to /var/cache/conftool/dbconfig/20230308-070458-marostegui.json [07:05:05] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:05:08] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T330991 [07:05:14] T330991: Switchover s8 master (db1109 -> db1126) - https://phabricator.wikimedia.org/T330991 [07:05:31] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T330991 [07:05:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1126 with weight 0 T330991', diff saved to https://phabricator.wikimedia.org/P45391 and previous config saved to /var/cache/conftool/dbconfig/20230308-070544-root.json [07:06:48] (03PS2) 10Marostegui: mariadb: Promote db1126 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/893431 (https://phabricator.wikimedia.org/T330991) (owner: 10Gerrit maintenance bot) [07:07:39] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:08:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1126 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/893431 (https://phabricator.wikimedia.org/T330991) (owner: 10Gerrit maintenance bot) [07:21:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P45392 and previous config saved to /var/cache/conftool/dbconfig/20230308-072128-root.json [07:22:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:22:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:29:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [07:29:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance [07:29:28] !log Starting s8 eqiad failover from db1109 to db1126 - T330991 [07:29:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T329260)', diff saved to https://phabricator.wikimedia.org/P45393 and previous config saved to /var/cache/conftool/dbconfig/20230308-072932-marostegui.json [07:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:36] T330991: Switchover s8 master (db1109 -> db1126) - https://phabricator.wikimedia.org/T330991 [07:29:42] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:30:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1126 to s8 primary T330991', diff saved to https://phabricator.wikimedia.org/P45394 and previous config saved to /var/cache/conftool/dbconfig/20230308-073005-root.json [07:31:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1109 T330991', diff saved to https://phabricator.wikimedia.org/P45395 and previous config saved to /var/cache/conftool/dbconfig/20230308-073110-root.json [07:32:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T329260)', diff saved to https://phabricator.wikimedia.org/P45396 and previous config saved to /var/cache/conftool/dbconfig/20230308-073228-marostegui.json [07:36:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [07:36:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [07:36:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45397 and previous config saved to /var/cache/conftool/dbconfig/20230308-073633-root.json [07:42:53] !log taavi@deploy2002 Started deploy [horizon/deploy@9d02cd6]: updating wmf-sudo-dashboard [07:44:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1109', diff saved to https://phabricator.wikimedia.org/P45398 and previous config saved to /var/cache/conftool/dbconfig/20230308-074427-marostegui.json [07:44:46] (03CR) 10Nicolas Fraison: [C: 03+2] hadoop: decrease log retention from 40d to 14d [puppet] - 10https://gerrit.wikimedia.org/r/894481 (owner: 10Nicolas Fraison) [07:47:10] (03CR) 10Elukey: "Left a couple of comments, plus I have another one - I didn't get why the mesh config's version was bumped to 1.1.0 in this case, could yo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [07:47:19] (03CR) 10Muehlenhoff: logstash: Enable profile::auto_restarts::service for apache2-htcacheclean (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:47:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P45399 and previous config saved to /var/cache/conftool/dbconfig/20230308-074735-marostegui.json [07:47:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [07:47:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [07:47:50] !log taavi@deploy2002 Finished deploy [horizon/deploy@9d02cd6]: updating wmf-sudo-dashboard (duration: 04m 56s) [07:47:57] (03CR) 10Elukey: [C: 03+1] "\o/" [labs/private] - 10https://gerrit.wikimedia.org/r/895237 (https://phabricator.wikimedia.org/T329717) (owner: 10JMeybohm) [07:49:29] (03PS2) 10Alexandros Kosiaris: WikiKube eqiad: Remove the old IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890805 (https://phabricator.wikimedia.org/T326617) [07:49:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [07:50:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [07:50:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] WikiKube eqiad: Remove the old IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890805 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [07:50:37] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:51:10] (03Merged) 10jenkins-bot: WikiKube eqiad: Remove the old IP space [homer/public] - 10https://gerrit.wikimedia.org/r/890805 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [07:51:22] (03PS1) 10Slyngshede: P:IDM enable debug logging [puppet] - 10https://gerrit.wikimedia.org/r/895663 [07:51:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45400 and previous config saved to /var/cache/conftool/dbconfig/20230308-075139-root.json [07:52:49] (03CR) 10Slyngshede: [C: 03+2] P:IDM enable debug logging [puppet] - 10https://gerrit.wikimedia.org/r/895663 (owner: 10Slyngshede) [07:55:11] (03CR) 10JMeybohm: Remove the .Values.kubernetesApi hack (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [07:56:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [07:56:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [07:58:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [07:58:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [07:58:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T329203)', diff saved to https://phabricator.wikimedia.org/P45401 and previous config saved to /var/cache/conftool/dbconfig/20230308-075857-marostegui.json [07:59:04] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:07] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] secrets/ssl: Remove keys for kubernetes etcd clusters [labs/private] - 10https://gerrit.wikimedia.org/r/895237 (https://phabricator.wikimedia.org/T329717) (owner: 10JMeybohm) [08:01:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 22 hosts with reason: Schema change [08:01:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 22 hosts with reason: Schema change [08:01:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 20 hosts with reason: Schema change [08:01:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 20 hosts with reason: Schema change [08:02:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P45402 and previous config saved to /var/cache/conftool/dbconfig/20230308-080241-marostegui.json [08:02:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2093.codfw.wmnet [08:02:46] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: match also FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/895364 (owner: 10Volans) [08:02:47] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:48] (03PS1) 10Marostegui: mariadb: Decommission db2093 [puppet] - 10https://gerrit.wikimedia.org/r/895664 (https://phabricator.wikimedia.org/T330827) [08:04:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [08:04:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [08:04:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T328817)', diff saved to https://phabricator.wikimedia.org/P45403 and previous config saved to /var/cache/conftool/dbconfig/20230308-080431-marostegui.json [08:04:38] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:06:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45404 and previous config saved to /var/cache/conftool/dbconfig/20230308-080644-root.json [08:07:15] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [08:07:37] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2093 [puppet] - 10https://gerrit.wikimedia.org/r/895664 (https://phabricator.wikimedia.org/T330827) (owner: 10Marostegui) [08:08:42] (03PS1) 10Alexandros Kosiaris: Remove old wikikube IP spaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/895686 (https://phabricator.wikimedia.org/T326617) [08:09:21] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2093.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [08:10:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T329203)', diff saved to https://phabricator.wikimedia.org/P45405 and previous config saved to /var/cache/conftool/dbconfig/20230308-081027-marostegui.json [08:10:34] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:10:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2093.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [08:10:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:10:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2093.codfw.wmnet [08:10:43] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:09] 10ops-codfw, 10DBA, 10decommission-hardware: decommission db2093.codfw.wmnet - https://phabricator.wikimedia.org/T330827 (10Marostegui) This is ready for DC-Ops [08:11:14] 10ops-codfw, 10DBA, 10decommission-hardware: decommission db2093.codfw.wmnet - https://phabricator.wikimedia.org/T330827 (10Marostegui) a:05Marostegui→03None [08:11:27] 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2093.codfw.wmnet - https://phabricator.wikimedia.org/T330827 (10Marostegui) [08:12:08] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Marostegui) [08:12:20] (03CR) 10Muehlenhoff: [C: 03+1] "Updated results look good, glad to finally see this automated, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/894729 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [08:12:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 15 hosts with reason: Schema change [08:12:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 15 hosts with reason: Schema change [08:14:59] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MoritzMuehlenhoff) [08:15:11] !log Deploy schema change on s1 eqiad dbmaint T329260 [08:15:14] !log Deploy schema change on s4 eqiad dbmaint T329260 [08:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:17] !log Deploy schema change on s7 eqiad dbmaint T329260 [08:15:17] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 19 hosts with reason: Schema change [08:15:48] !log Deploy schema change on s8 eqiad dbmaint T329260 [08:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 19 hosts with reason: Schema change [08:16:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T328817)', diff saved to https://phabricator.wikimedia.org/P45406 and previous config saved to /var/cache/conftool/dbconfig/20230308-081614-marostegui.json [08:16:22] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:17:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T329260)', diff saved to https://phabricator.wikimedia.org/P45407 and previous config saved to /var/cache/conftool/dbconfig/20230308-081748-marostegui.json [08:17:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:18:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:18:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T329260)', diff saved to https://phabricator.wikimedia.org/P45408 and previous config saved to /var/cache/conftool/dbconfig/20230308-081809-marostegui.json [08:19:01] (03CR) 10Nicolas Fraison: [C: 03+2] hadoop: set quota init threads to speed up failover [puppet] - 10https://gerrit.wikimedia.org/r/895127 (https://phabricator.wikimedia.org/T310293) (owner: 10Nicolas Fraison) [08:19:21] (03PS1) 10Alexandros Kosiaris: Remove old wikikube IP spaces [puppet] - 10https://gerrit.wikimedia.org/r/895687 (https://phabricator.wikimedia.org/T326617) [08:19:23] (03PS1) 10Alexandros Kosiaris: m5: Remove deprecated? toolhub wikikube grants [puppet] - 10https://gerrit.wikimedia.org/r/895688 (https://phabricator.wikimedia.org/T326617) [08:21:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T329260)', diff saved to https://phabricator.wikimedia.org/P45409 and previous config saved to /var/cache/conftool/dbconfig/20230308-082112-marostegui.json [08:21:19] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:21:35] (03PS1) 10Nicolas Fraison: Revert "hadoop: set quota init threads to speed up failover" [puppet] - 10https://gerrit.wikimedia.org/r/895689 [08:21:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45410 and previous config saved to /var/cache/conftool/dbconfig/20230308-082149-root.json [08:22:40] (03CR) 10Nicolas Fraison: [C: 03+2] Revert "hadoop: set quota init threads to speed up failover" [puppet] - 10https://gerrit.wikimedia.org/r/895689 (owner: 10Nicolas Fraison) [08:22:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove old wikikube IP spaces [puppet] - 10https://gerrit.wikimedia.org/r/895687 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [08:24:33] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01616 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:25:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P45411 and previous config saved to /var/cache/conftool/dbconfig/20230308-082533-marostegui.json [08:29:09] (03PS1) 10Nicolas Fraison: hadoop: set quota init threads to speed up failover" [puppet] - 10https://gerrit.wikimedia.org/r/895690 [08:29:15] (03CR) 10Marostegui: [C: 03+1] "This requires manual deletion from the database. Happy to do it once you've merged it." [puppet] - 10https://gerrit.wikimedia.org/r/895688 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [08:30:21] (03PS2) 10Nicolas Fraison: hadoop: set quota init threads to speed up failover [puppet] - 10https://gerrit.wikimedia.org/r/895690 [08:31:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P45412 and previous config saved to /var/cache/conftool/dbconfig/20230308-083121-marostegui.json [08:32:50] !log Deploy schema change on s5 eqiad dbmaint T329260 [08:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:56] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:32:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Schema change [08:33:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Schema change [08:33:49] PROBLEM - Check systemd state on kubemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 15 hosts with reason: Schema change [08:34:02] !log Deploy schema change on s3 eqiad dbmaint T329260 [08:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] m5: Remove deprecated? toolhub wikikube grants [puppet] - 10https://gerrit.wikimedia.org/r/895688 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [08:34:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 15 hosts with reason: Schema change [08:34:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks! Merging right now" [puppet] - 10https://gerrit.wikimedia.org/r/895688 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [08:36:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P45413 and previous config saved to /var/cache/conftool/dbconfig/20230308-083618-marostegui.json [08:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45414 and previous config saved to /var/cache/conftool/dbconfig/20230308-083654-root.json [08:37:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 15%: Repooling', diff saved to https://phabricator.wikimedia.org/P45415 and previous config saved to /var/cache/conftool/dbconfig/20230308-083731-root.json [08:37:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [08:38:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [08:38:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3315', diff saved to https://phabricator.wikimedia.org/P45416 and previous config saved to /var/cache/conftool/dbconfig/20230308-083843-marostegui.json [08:40:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [08:40:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P45417 and previous config saved to /var/cache/conftool/dbconfig/20230308-084040-marostegui.json [08:40:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [08:40:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T329260)', diff saved to https://phabricator.wikimedia.org/P45418 and previous config saved to /var/cache/conftool/dbconfig/20230308-084053-marostegui.json [08:41:01] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:41:16] !log Deploy schema change on s3 eqiad dbmaint T329203 [08:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:22] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:42:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove old wikikube IP spaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/895686 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [08:42:58] (03CR) 10Giuseppe Lavagetto: "As I wrote in the comments, I don't think the code I wrote for resolving with an Edns client subnet is good enough for wmflib.dns as it's " [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [08:44:50] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) 05Stalled→03Open [08:45:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T329260)', diff saved to https://phabricator.wikimedia.org/P45419 and previous config saved to /var/cache/conftool/dbconfig/20230308-084525-marostegui.json [08:46:18] (03PS1) 10Vgutierrez: hiera: Enable HAProxy systemd hardening in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/895692 (https://phabricator.wikimedia.org/T323944) [08:46:24] (03Merged) 10jenkins-bot: Remove old wikikube IP spaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/895686 (https://phabricator.wikimedia.org/T326617) (owner: 10Alexandros Kosiaris) [08:46:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P45420 and previous config saved to /var/cache/conftool/dbconfig/20230308-084628-marostegui.json [08:47:13] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable HAProxy systemd hardening in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/895692 (https://phabricator.wikimedia.org/T323944) (owner: 10Vgutierrez) [08:49:12] !log installing git security updates [08:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:03] !log re-enable HAProxy systemd service unit hardening in ulsfo - T323944 [08:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:09] T323944: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 [08:51:56] (03CR) 10Nicolas Fraison: [C: 03+2] hadoop: automate refresh of exclude nodes in NN and RM [puppet] - 10https://gerrit.wikimedia.org/r/893999 (https://phabricator.wikimedia.org/T330982) (owner: 10Nicolas Fraison) [08:52:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45421 and previous config saved to /var/cache/conftool/dbconfig/20230308-085159-root.json [08:52:36] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:53:14] !log remove 10.64.64.0/21 and 10.192.64.0/21 from calico GlobalNetworkPolicies T326617 [08:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:21] T326617: Decide on new Pod and Sevice IPv4 ranges for wikikube clusters - https://phabricator.wikimedia.org/T326617 [08:53:23] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:55:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T329203)', diff saved to https://phabricator.wikimedia.org/P45422 and previous config saved to /var/cache/conftool/dbconfig/20230308-085546-marostegui.json [08:55:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [08:55:54] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:56:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [08:56:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T329203)', diff saved to https://phabricator.wikimedia.org/P45423 and previous config saved to /var/cache/conftool/dbconfig/20230308-085608-marostegui.json [08:58:55] PROBLEM - Check whether ferm is active by checking the default input chain on kubemaster1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:59:56] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [09:00:05] jeena and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T0900). [09:00:24] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [09:00:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P45424 and previous config saved to /var/cache/conftool/dbconfig/20230308-090031-marostegui.json [09:00:42] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [09:01:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T328817)', diff saved to https://phabricator.wikimedia.org/P45425 and previous config saved to /var/cache/conftool/dbconfig/20230308-090134-marostegui.json [09:01:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [09:01:42] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:01:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [09:01:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T328817)', diff saved to https://phabricator.wikimedia.org/P45426 and previous config saved to /var/cache/conftool/dbconfig/20230308-090156-marostegui.json [09:02:00] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [09:05:46] (03PS1) 10Marostegui: mariadb: Move db1101 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/895693 (https://phabricator.wikimedia.org/T331511) [09:07:21] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:07:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T329203)', diff saved to https://phabricator.wikimedia.org/P45428 and previous config saved to /var/cache/conftool/dbconfig/20230308-090739-marostegui.json [09:07:47] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:08:37] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:08:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1101 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/895693 (https://phabricator.wikimedia.org/T331511) (owner: 10Marostegui) [09:09:40] (03PS1) 10JMeybohm: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) [09:10:43] haproxy alerts are to be expected [09:12:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T328817)', diff saved to https://phabricator.wikimedia.org/P45429 and previous config saved to /var/cache/conftool/dbconfig/20230308-091223-marostegui.json [09:12:30] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:15:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P45430 and previous config saved to /var/cache/conftool/dbconfig/20230308-091538-marostegui.json [09:19:14] 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Radar): git: detected dubious ownership in repository at '/srv/mediawiki-staging' - https://phabricator.wikimedia.org/T325128 (10hashar) Part of the fix for the deployment servers and `scap` is https://gerrit.wikimedia.or... [09:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P45431 and previous config saved to /var/cache/conftool/dbconfig/20230308-092246-marostegui.json [09:24:22] (03PS2) 10Jelto: gitlab: production host needs additional flag for restore [puppet] - 10https://gerrit.wikimedia.org/r/895310 (https://phabricator.wikimedia.org/T331295) [09:25:35] RECOVERY - Check systemd state on dumpsdata1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:26] (03PS1) 10Volans: Use builtins GenericAlias objects for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895699 [09:26:28] (03PS1) 10Volans: Use collections.abc GenericAlias for type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/895700 [09:27:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P45432 and previous config saved to /var/cache/conftool/dbconfig/20230308-092729-marostegui.json [09:29:11] (03CR) 10Elukey: [C: 03+1] Remove the .Values.kubernetesApi hack (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [09:29:17] (03CR) 10Btullis: [C: 03+1] "lgtm, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/895690 (owner: 10Nicolas Fraison) [09:29:40] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: production host needs additional flag for restore [puppet] - 10https://gerrit.wikimedia.org/r/895310 (https://phabricator.wikimedia.org/T331295) (owner: 10Jelto) [09:30:30] !log drain ganeti1011 for eventual reimage to Bullseye T311687 [09:30:33] (03PS1) 10Nicolas Fraison: hiveserver: update max metaspace setting in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/895701 [09:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:36] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [09:30:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T329260)', diff saved to https://phabricator.wikimedia.org/P45433 and previous config saved to /var/cache/conftool/dbconfig/20230308-093045-marostegui.json [09:30:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [09:30:52] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [09:30:55] (03CR) 10Btullis: [C: 03+1] sre.hadoop: do not override API method [cookbooks] - 10https://gerrit.wikimedia.org/r/895206 (owner: 10Volans) [09:30:58] (03PS2) 10Nicolas Fraison: hiveserver: update max metaspace setting in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/895701 (https://phabricator.wikimedia.org/T303168) [09:31:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [09:31:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T329260)', diff saved to https://phabricator.wikimedia.org/P45434 and previous config saved to /var/cache/conftool/dbconfig/20230308-093106-marostegui.json [09:31:42] (03CR) 10Nicolas Fraison: [C: 03+2] hiveserver: update max metaspace setting in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/895701 (https://phabricator.wikimedia.org/T303168) (owner: 10Nicolas Fraison) [09:32:42] (03CR) 10Btullis: [C: 03+1] Add SPDX headers to additional DE profiles [puppet] - 10https://gerrit.wikimedia.org/r/890000 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:34:08] (03PS1) 10Vgutierrez: hiera: Enable ESI testing in cp5024 [puppet] - 10https://gerrit.wikimedia.org/r/895702 (https://phabricator.wikimedia.org/T308799) [09:34:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T329260)', diff saved to https://phabricator.wikimedia.org/P45435 and previous config saved to /var/cache/conftool/dbconfig/20230308-093424-marostegui.json [09:35:43] (03CR) 10JMeybohm: [C: 03+2] Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [09:36:07] (03CR) 10Jelto: [C: 03+2] gitlab: production host needs additional flag for restore [puppet] - 10https://gerrit.wikimedia.org/r/895310 (https://phabricator.wikimedia.org/T331295) (owner: 10Jelto) [09:37:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P45436 and previous config saved to /var/cache/conftool/dbconfig/20230308-093752-marostegui.json [09:39:54] (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers to additional DE profiles [puppet] - 10https://gerrit.wikimedia.org/r/890000 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:40:07] (03Merged) 10jenkins-bot: Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895336 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [09:41:07] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005877 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:41:42] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:41:47] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:42:27] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [09:42:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P45437 and previous config saved to /var/cache/conftool/dbconfig/20230308-094236-marostegui.json [09:43:01] !log jayme@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:43:19] (03PS31) 10Btullis: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) [09:43:40] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] nodejs16: Add /bin/nodejs symlink [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/894687 (owner: 10Clément Goubert) [09:44:50] (03CR) 10Nicolas Fraison: [C: 03+2] hadoop: set quota init threads to speed up failover [puppet] - 10https://gerrit.wikimedia.org/r/895690 (owner: 10Nicolas Fraison) [09:45:18] !log Rebuilding production-images for 894687 [09:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:55] (03PS1) 10JMeybohm: flink-session-cluster: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/895704 (https://phabricator.wikimedia.org/T326729) [09:49:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P45438 and previous config saved to /var/cache/conftool/dbconfig/20230308-094931-marostegui.json [09:50:29] (03PS2) 10Muehlenhoff: logstash: Stop apache2-htcacheclean.service via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/895144 [09:52:51] (03PS3) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 [09:53:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T329203)', diff saved to https://phabricator.wikimedia.org/P45439 and previous config saved to /var/cache/conftool/dbconfig/20230308-095259-marostegui.json [09:53:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [09:53:07] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:53:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [09:53:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T329203)', diff saved to https://phabricator.wikimedia.org/P45440 and previous config saved to /var/cache/conftool/dbconfig/20230308-095320-marostegui.json [09:56:07] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) That looks like a DNS error? unless I'm misreading `net::ERR_NAME_NOT_RESOLVED` which was... [09:57:38] (03CR) 10Effie Mouzeli: [C: 03+2] Add kubernetes102[3,4] to the wikikube-eqiad cluster 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/894697 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [09:57:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T328817)', diff saved to https://phabricator.wikimedia.org/P45441 and previous config saved to /var/cache/conftool/dbconfig/20230308-095742-marostegui.json [09:57:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [09:57:45] (03PS3) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-eqiad cluster 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/894697 (https://phabricator.wikimedia.org/T313874) [09:57:50] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:57:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [09:58:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T328817)', diff saved to https://phabricator.wikimedia.org/P45442 and previous config saved to /var/cache/conftool/dbconfig/20230308-095804-marostegui.json [09:59:24] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10MatthewVernon) >>! In T331138#8675244, @Joe wrote: >>>! In T331138#8664269, @MatthewVernon wrote: > >> From a Data Persi... [09:59:27] 10SRE, 10vm-requests: eqiad and codfw: 1 VM each requested for wikikube-staging - https://phabricator.wikimedia.org/T329940 (10jijiki) a:05JMeybohm→03jijiki [09:59:54] (03CR) 10JMeybohm: [C: 04-1] "If you're up to it you could also move away from kubernetesMasters.cidrs as we have calico 3.23 now. See https://phabricator.wikimedia.org" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [10:01:09] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10MatthewVernon) >>! In T331138#8675245, @Joe wrote: > Also: if pre-generation of thumbs makes sense (does it? do we have a... [10:02:47] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Clement_Goubert) [10:02:59] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [10:03:09] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable ESI testing in cp5024 [puppet] - 10https://gerrit.wikimedia.org/r/895702 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [10:03:55] (03PS1) 10Elukey: ml-services: upgrade docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/895706 (https://phabricator.wikimedia.org/T329032) [10:04:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P45443 and previous config saved to /var/cache/conftool/dbconfig/20230308-100437-marostegui.json [10:04:42] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10MatthewVernon) One further thought - it would be nice if we could take swift's special 404-handler out of the equation, a... [10:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T329203)', diff saved to https://phabricator.wikimedia.org/P45444 and previous config saved to /var/cache/conftool/dbconfig/20230308-100502-marostegui.json [10:05:09] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:08:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T328817)', diff saved to https://phabricator.wikimedia.org/P45445 and previous config saved to /var/cache/conftool/dbconfig/20230308-100826-marostegui.json [10:08:35] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:09:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10MatthewVernon) [10:09:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:10:24] (03CR) 10JMeybohm: [C: 03+2] flink-session-cluster: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/895704 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [10:13:43] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: upgrade docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/895706 (https://phabricator.wikimedia.org/T329032) (owner: 10Elukey) [10:14:42] (03Merged) 10jenkins-bot: flink-session-cluster: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/895704 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [10:14:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:15:26] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Again `00537: FAILED: internal_api_error_UploadChunkFileException: [30779e28-6cca-4162-bd86-459bb... [10:16:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10MatthewVernon) @DMartin-WMF can I confirm you don't require kerberos access (you didn't explicitly ask for it; cf https://wikitech.wikimedia.org/wiki/Analytics/Data_acc... [10:17:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nicholas Ifeajika - https://phabricator.wikimedia.org/T331277 (10Miriam) Thank you @matthewvernon! [10:17:22] (03CR) 10Jbond: [C: 03+1] alertmanager: match also FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/895364 (owner: 10Volans) [10:17:23] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) >>! In T328872#8675696, @MatthewVernon wrote: > That looks like a DNS error? unless I'm misreading... [10:17:34] (03CR) 10Ottomata: "Thanks! LGTM! One post nit about a comment." [puppet] - 10https://gerrit.wikimedia.org/r/894740 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [10:19:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T329260)', diff saved to https://phabricator.wikimedia.org/P45446 and previous config saved to /var/cache/conftool/dbconfig/20230308-101944-marostegui.json [10:19:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [10:19:52] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [10:20:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [10:20:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [10:20:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [10:20:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P45447 and previous config saved to /var/cache/conftool/dbconfig/20230308-102009-marostegui.json [10:21:28] (03PS4) 10Hnowlan: service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) [10:22:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10Ottomata) Approved. [10:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T329260)', diff saved to https://phabricator.wikimedia.org/P45448 and previous config saved to /var/cache/conftool/dbconfig/20230308-102326-marostegui.json [10:23:30] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MatthewVernon) @FNavas-foundation can I double-check what access you need for what purposes, please? You say you need access to turnilo - that can be done with just `w... [10:23:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P45449 and previous config saved to /var/cache/conftool/dbconfig/20230308-102334-marostegui.json [10:23:58] (KubernetesCalicoDown) firing: kubernetes1024.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1024.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:25:13] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: upgrade docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/895706 (https://phabricator.wikimedia.org/T329032) (owner: 10Elukey) [10:25:58] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1024:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1024 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:27:57] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for s-mukuti - https://phabricator.wikimedia.org/T331402 (10MatthewVernon) @S_Mukuti I think this is a request to join the `wmf` LDAP group only? Also, can you double-check the wikitech username, please? I can't find an account by that name. [10:28:26] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [10:28:58] (KubernetesCalicoDown) firing: (2) kubernetes1023.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:30:58] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes1024:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1024 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:35:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P45450 and previous config saved to /var/cache/conftool/dbconfig/20230308-103515-marostegui.json [10:38:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P45451 and previous config saved to /var/cache/conftool/dbconfig/20230308-103833-marostegui.json [10:38:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P45452 and previous config saved to /var/cache/conftool/dbconfig/20230308-103840-marostegui.json [10:38:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:39:41] (03CR) 10Effie Mouzeli: [C: 03+2] Add kubernetes102[3,4] to the wikikube-eqiad cluster 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/894700 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [10:40:02] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [10:42:57] !log otto@deploy2002 Started deploy [analytics/refinery@eb29334]: Regular analytics weekly train [analytics/refinery@eb29334] [10:43:07] (03CR) 10Elukey: [C: 03+2] ml-services: upgrade docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/895706 (https://phabricator.wikimedia.org/T329032) (owner: 10Elukey) [10:43:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:44:58] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [10:45:01] 10SRE, 10Infrastructure-Foundations: Extend LDAP to allow storing all necessary attributes - https://phabricator.wikimedia.org/T320794 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Low [10:48:57] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:49:40] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:50:03] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:50:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T329203)', diff saved to https://phabricator.wikimedia.org/P45453 and previous config saved to /var/cache/conftool/dbconfig/20230308-105022-marostegui.json [10:50:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [10:50:29] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:50:31] (03PS2) 10Jbond: sre.__init__.py: update minor formating nits [cookbooks] - 10https://gerrit.wikimedia.org/r/849544 [10:50:33] (03PS9) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [10:50:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [10:50:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T329203)', diff saved to https://phabricator.wikimedia.org/P45454 and previous config saved to /var/cache/conftool/dbconfig/20230308-105043-marostegui.json [10:50:55] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:51:18] !log otto@deploy2002 Finished deploy [analytics/refinery@eb29334]: Regular analytics weekly train [analytics/refinery@eb29334] (duration: 08m 20s) [10:51:34] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:51:50] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:52:03] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:52:14] (03PS19) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [10:52:18] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:52:22] (03PS7) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [10:52:31] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:52:37] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [10:52:50] (03PS1) 10Filippo Giunchedi: o11y: update thanos sidecar alerts [alerts] - 10https://gerrit.wikimedia.org/r/895709 (https://phabricator.wikimedia.org/T309182) [10:53:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P45455 and previous config saved to /var/cache/conftool/dbconfig/20230308-105339-marostegui.json [10:53:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T328817)', diff saved to https://phabricator.wikimedia.org/P45456 and previous config saved to /var/cache/conftool/dbconfig/20230308-105347-marostegui.json [10:53:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:53:54] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:53:58] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [10:54:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:54:17] (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [10:54:37] (03PS1) 10Volans: setup.py: bump dependencies minimum version [software/spicerack] - 10https://gerrit.wikimedia.org/r/895710 [10:54:39] (03PS1) 10Volans: setup.py: remove upper limit for prospector [software/spicerack] - 10https://gerrit.wikimedia.org/r/895711 [10:54:51] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [10:55:06] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: update thanos sidecar alerts [alerts] - 10https://gerrit.wikimedia.org/r/895709 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:55:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:59] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [10:58:10] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) 05Open→03Resolved [10:58:22] (03PS9) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [10:58:26] (03CR) 10Jbond: "This this one is good to go" [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [10:59:17] (03Abandoned) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [10:59:28] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) >>! In T331470#8674700, @Jhancock.wm wrote: > I've made the patches with some changes. Port 46 on cloudsw1-b1-codfw is already configured... [10:59:55] (03Abandoned) 10Jbond: cookbook sre.misc-clusters.apt: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T1100) [11:00:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:01:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T329203)', diff saved to https://phabricator.wikimedia.org/P45457 and previous config saved to /var/cache/conftool/dbconfig/20230308-110121-marostegui.json [11:01:28] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:02:09] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+1] analytics::refinery::job::eventlogging_to_druid: Default to deploy-mode cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895228 (owner: 10Mforns) [11:02:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [11:03:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [11:03:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T328817)', diff saved to https://phabricator.wikimedia.org/P45458 and previous config saved to /var/cache/conftool/dbconfig/20230308-110306-marostegui.json [11:03:13] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:04:15] (03PS2) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-eqiad cluster 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/894700 (https://phabricator.wikimedia.org/T313874) [11:05:17] (03Abandoned) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 (owner: 10Jbond) [11:05:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add kubernetes102[3,4] to the wikikube-eqiad cluster 2/3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894700 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [11:05:36] (03PS2) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-eqiad cluster 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/894701 (https://phabricator.wikimedia.org/T313874) [11:05:59] (03Abandoned) 10Jbond: WIP) sre.apt.audit: produce a report of manually packages [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 (owner: 10Jbond) [11:06:33] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/895358 (owner: 10Volans) [11:06:40] (03PS1) 10Filippo Giunchedi: alertmanager: highlight 'source' label [puppet] - 10https://gerrit.wikimedia.org/r/895713 [11:06:57] (03PS4) 10Hnowlan: Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) [11:07:08] (03CR) 10Volans: [C: 03+2] homer: increase default timeout to 60s [puppet] - 10https://gerrit.wikimedia.org/r/895358 (owner: 10Volans) [11:07:43] (03PS3) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-eqiad cluster 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/894700 (https://phabricator.wikimedia.org/T313874) [11:08:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add kubernetes102[3,4] to the wikikube-eqiad cluster 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/894700 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [11:08:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T329260)', diff saved to https://phabricator.wikimedia.org/P45459 and previous config saved to /var/cache/conftool/dbconfig/20230308-110846-marostegui.json [11:08:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:08:53] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [11:09:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:09:05] (03PS4) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-eqiad cluster 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/894700 (https://phabricator.wikimedia.org/T313874) [11:09:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T329260)', diff saved to https://phabricator.wikimedia.org/P45460 and previous config saved to /var/cache/conftool/dbconfig/20230308-110907-marostegui.json [11:09:28] (03PS1) 10MVernon: add lwatson to ldap_only_users (for wmf) [puppet] - 10https://gerrit.wikimedia.org/r/895714 (https://phabricator.wikimedia.org/T331370) [11:09:37] (03CR) 10Effie Mouzeli: [C: 03+2] Add kubernetes102[3,4] to the wikikube-eqiad cluster 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/894700 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [11:10:40] (03CR) 10Volans: [C: 03+2] sre.hadoop: do not override API method [cookbooks] - 10https://gerrit.wikimedia.org/r/895206 (owner: 10Volans) [11:10:42] (03PS38) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [11:10:53] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:11:00] (03CR) 10Volans: [C: 03+2] sre.{ganeti,hardware,hosts}: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895207 (owner: 10Volans) [11:11:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/895714 (https://phabricator.wikimedia.org/T331370) (owner: 10MVernon) [11:11:22] (03CR) 10MVernon: [C: 03+2] add lwatson to ldap_only_users (for wmf) [puppet] - 10https://gerrit.wikimedia.org/r/895714 (https://phabricator.wikimedia.org/T331370) (owner: 10MVernon) [11:11:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] profile::kubernetes::client: Switch to k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/895138 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [11:11:40] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:12:33] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:34] (03Merged) 10jenkins-bot: sre.hadoop: do not override API method [cookbooks] - 10https://gerrit.wikimedia.org/r/895206 (owner: 10Volans) [11:12:39] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:12:46] (03Merged) 10jenkins-bot: sre.{ganeti,hardware,hosts}: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895207 (owner: 10Volans) [11:13:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T328817)', diff saved to https://phabricator.wikimedia.org/P45461 and previous config saved to /var/cache/conftool/dbconfig/20230308-111344-marostegui.json [11:13:51] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:13:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T329260)', diff saved to https://phabricator.wikimedia.org/P45462 and previous config saved to /var/cache/conftool/dbconfig/20230308-111355-marostegui.json [11:14:02] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [11:14:18] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @lwatson this is now done. [11:14:20] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:14:22] (03PS5) 10Hnowlan: Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) [11:14:46] vgutierrez: ok for me to go ahead with repooling eqiad at traffic layer? https://gerrit.wikimedia.org/r/c/operations/dns/+/894559 [11:15:30] (03PS5) 10Hnowlan: service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) [11:15:52] (03CR) 10Vgutierrez: [C: 03+1] Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/894559 (https://phabricator.wikimedia.org/T331285) (owner: 10Clément Goubert) [11:16:01] claime: yep [11:16:22] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/849544 (owner: 10Jbond) [11:16:25] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:16:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P45463 and previous config saved to /var/cache/conftool/dbconfig/20230308-111628-marostegui.json [11:16:35] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [11:17:16] vgutierrez: Great, proceeding then [11:17:41] (03PS4) 10Clément Goubert: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/894559 (https://phabricator.wikimedia.org/T331285) [11:17:53] (03CR) 10Clément Goubert: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/894559 (https://phabricator.wikimedia.org/T331285) (owner: 10Clément Goubert) [11:18:08] (03CR) 10Hnowlan: [C: 03+1] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/894542 (owner: 10Jaime Nuche) [11:18:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10MatthewVernon) [11:18:58] (KubernetesCalicoDown) resolved: (2) kubernetes1023.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:19:02] (03PS6) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) [11:19:55] (03PS39) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [11:20:03] (03CR) 10Jbond: [C: 03+2] sre.__init__.py: update minor formating nits [cookbooks] - 10https://gerrit.wikimedia.org/r/849544 (owner: 10Jbond) [11:20:30] (03PS1) 10Hnowlan: conftool-data: add device-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/895716 (https://phabricator.wikimedia.org/T320967) [11:20:47] (03CR) 10Clément Goubert: [C: 03+2] Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/894559 (https://phabricator.wikimedia.org/T331285) (owner: 10Clément Goubert) [11:20:49] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:21:10] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:21:33] !log Traffic: repool eqiad for user traffic - T331285 [11:21:33] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for s-mukuti - https://phabricator.wikimedia.org/T331402 (10MatthewVernon) ...though there is a "Sarah Mukuti" wikitech user, which I think is correct? [11:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:39] T331285: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 [11:21:57] (03Merged) 10jenkins-bot: sre.__init__.py: update minor formating nits [cookbooks] - 10https://gerrit.wikimedia.org/r/849544 (owner: 10Jbond) [11:22:41] (03PS1) 10Filippo Giunchedi: dispatch/grafana: retry GETs too on LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/895719 [11:22:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40004/console" [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [11:22:53] (03PS7) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) [11:23:02] !log Traffic: authdns updated successfully for eqiad repool - T331285 [11:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:14] (03CR) 10Filippo Giunchedi: "I noticed the dispatch ldap sync failed due to 500s on GET" [puppet] - 10https://gerrit.wikimedia.org/r/895719 (owner: 10Filippo Giunchedi) [11:23:23] (03PS40) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [11:23:33] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [11:23:36] (03Abandoned) 10Muehlenhoff: Enable command_broadcast to the new puppetdb 7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/889817 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:23:46] (03CR) 10Effie Mouzeli: [C: 03+2] Add kubernetes102[3,4] to the wikikube-eqiad cluster 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/894701 (https://phabricator.wikimedia.org/T313874) (owner: 10Effie Mouzeli) [11:23:59] (03PS3) 10Effie Mouzeli: Add kubernetes102[3,4] to the wikikube-eqiad cluster 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/894701 (https://phabricator.wikimedia.org/T313874) [11:24:17] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:24:54] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:25:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host urldownloader1003.wikimedia.org with OS bullseye [11:25:41] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host urldownloader1003.wikimedia.org with OS bullseye [11:26:08] !log T307943 upgrade kubernetes-client on deploy1002 deploy2002 [11:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:14] T307943: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 [11:26:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40005/console" [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [11:27:26] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [11:27:28] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:27:41] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:27:48] !log otto@deploy2002 Started deploy [analytics/refinery@d4aaff9]: Regular analytics weekly train [analytics/refinery@d4aaff9] [11:28:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P45464 and previous config saved to /var/cache/conftool/dbconfig/20230308-112850-marostegui.json [11:29:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P45465 and previous config saved to /var/cache/conftool/dbconfig/20230308-112901-marostegui.json [11:29:24] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:29:47] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:31:08] Traffic is picking up nicely in eqiad [11:31:10] (03PS41) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [11:31:35] RECOVERY - Check whether ferm is active by checking the default input chain on kubemaster1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:31:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P45466 and previous config saved to /var/cache/conftool/dbconfig/20230308-113136-marostegui.json [11:31:42] Forgot to heads up moritzm and marostegui, I repooled eqiad for traffic [11:32:15] RECOVERY - Check systemd state on kubemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:44] (03CR) 10Elukey: [C: 03+1] sre.k8s: fix issues reported by mypy [cookbooks] - 10https://gerrit.wikimedia.org/r/895208 (owner: 10Volans) [11:32:59] claime: traffic is already ramping up in eqiad [11:33:23] 12k rps and increasing [11:33:48] ack [11:33:51] yeah we're almost up to 1Gb/s [11:34:56] (03PS8) 10Jbond: profile::confd: add a confd profile [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) [11:35:52] (03PS6) 10Hnowlan: Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) [11:36:30] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/893468 (https://phabricator.wikimedia.org/T330849) (owner: 10Jbond) [11:36:32] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:36:39] (03CR) 10CI reject: [V: 04-1] Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:36:41] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/895716 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:37:04] (03CR) 10Jbond: [C: 03+2] SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [11:37:15] (03CR) 10CI reject: [V: 04-1] SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [11:37:27] !log otto@deploy2002 deploy aborted: Regular analytics weekly train [analytics/refinery@d4aaff9] (duration: 09m 38s) [11:37:29] !log otto@deploy2002 Started deploy [analytics/refinery@d4aaff9]: Regular analytics weekly train [analytics/refinery@d4aaff9] [11:39:10] vgutierrez: looks to be stabilizing [11:39:40] (03PS10) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 [11:39:49] (03CR) 10Jbond: [C: 03+2] SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [11:41:37] (03Merged) 10jenkins-bot: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond) [11:41:50] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:42:39] !log otto@deploy2002 Finished deploy [analytics/refinery@d4aaff9]: Regular analytics weekly train [analytics/refinery@d4aaff9] (duration: 05m 09s) [11:42:46] (03CR) 10Clément Goubert: [C: 03+1] helmfile: add device-analytics configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:43:11] (03CR) 10Clément Goubert: [C: 03+1] service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:43:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P45467 and previous config saved to /var/cache/conftool/dbconfig/20230308-114357-marostegui.json [11:44:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P45468 and previous config saved to /var/cache/conftool/dbconfig/20230308-114407-marostegui.json [11:44:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1039.eqiad.wmnet with OS bullseye [11:44:40] vgutierrez: everything looks good to me wrt to eqiad repool, moving on to restbase-async switchback [11:44:44] akosiaris: ^ [11:45:17] RECOVERY - Hadoop HDFS Namenode FSImage Age on an-master1002 is OK: FILE_AGE OK: /srv/hadoop/name/current/VERSION is 90 seconds old and 217 bytes https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:45:28] claime: ack [11:45:30] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) [11:45:40] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) 05Open→03Resolved [11:45:45] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [11:45:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3315', diff saved to https://phabricator.wikimedia.org/P45469 and previous config saved to /var/cache/conftool/dbconfig/20230308-114553-root.json [11:46:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T329203)', diff saved to https://phabricator.wikimedia.org/P45470 and previous config saved to /var/cache/conftool/dbconfig/20230308-114642-marostegui.json [11:46:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:46:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:46:47] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:46:51] (03CR) 10Volans: [C: 03+2] sre.k8s: fix issues reported by mypy [cookbooks] - 10https://gerrit.wikimedia.org/r/895208 (owner: 10Volans) [11:46:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T329203)', diff saved to https://phabricator.wikimedia.org/P45471 and previous config saved to /var/cache/conftool/dbconfig/20230308-114652-marostegui.json [11:46:55] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) 05Resolved→03In progress [11:47:02] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [11:47:14] !log otto@deploy2002 Started deploy [analytics/refinery@d4aaff9] (thin): Regular analytics weekly train THIN [analytics/refinery@d4aaff9] [11:47:22] !log otto@deploy2002 Finished deploy [analytics/refinery@d4aaff9] (thin): Regular analytics weekly train THIN [analytics/refinery@d4aaff9] (duration: 00m 07s) [11:47:32] !log otto@deploy2002 Started deploy [analytics/refinery@d4aaff9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d4aaff9] [11:47:53] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) {T331285} done, switching `restbase-async` back to its standard state. [11:48:39] (03Merged) 10jenkins-bot: sre.k8s: fix issues reported by mypy [cookbooks] - 10https://gerrit.wikimedia.org/r/895208 (owner: 10Volans) [11:48:46] !log Starting restbase-async switchback - T330651 [11:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:51] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [11:48:53] Heads up marostegui moritzm ^ [11:49:03] !log otto@deploy2002 Finished deploy [analytics/refinery@d4aaff9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d4aaff9] (duration: 01m 30s) [11:49:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [11:49:07] claime: thanks [11:49:14] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) 05Resolved→03Open @JHancock.wm my apologies errors abound on this one. I just realised that on the QFX5120 platform we can't mix and... [11:49:21] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route pool restbase-async in eqiad: T330651 [11:49:22] !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors [11:49:26] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors [11:49:29] (03CR) 10Volans: [C: 03+2] sre.mysql.upgrade: remove wrong line [cookbooks] - 10https://gerrit.wikimedia.org/r/895209 (owner: 10Volans) [11:49:41] 5 minutes mandatory wait [11:49:57] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for s-mukuti - https://phabricator.wikimedia.org/T331402 (10S_Mukuti) Yes "Sarah Mukuti" is the correct username. [11:50:27] (03CR) 10Volans: [C: 03+2] sre.mediawiki.route-traffic: fix wrong call [cookbooks] - 10https://gerrit.wikimedia.org/r/895210 (owner: 10Volans) [11:51:10] (03Merged) 10jenkins-bot: sre.mysql.upgrade: remove wrong line [cookbooks] - 10https://gerrit.wikimedia.org/r/895209 (owner: 10Volans) [11:51:27] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [11:52:14] (03Merged) 10jenkins-bot: sre.mediawiki.route-traffic: fix wrong call [cookbooks] - 10https://gerrit.wikimedia.org/r/895210 (owner: 10Volans) [11:52:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3315', diff saved to https://phabricator.wikimedia.org/P45472 and previous config saved to /var/cache/conftool/dbconfig/20230308-115252-root.json [11:53:53] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) [11:53:57] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [11:54:07] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:54:25] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool restbase-async in eqiad: T330651 [11:54:29] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [11:54:35] 10SRE, 10Traffic, 10serviceops, 10Datacenter-Switchover: March 2023 Traffic Repool checklist - https://phabricator.wikimedia.org/T331285 (10Clement_Goubert) [11:54:41] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [11:54:51] (03PS7) 10Hnowlan: Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) [11:54:59] !log restbase-async pooled in eqiad, depooling in codfw- T330651 [11:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:12] !log cgoubert@cumin1001 START - Cookbook sre.discovery.service-route depool restbase-async in codfw: T330651 [11:55:13] !log cgoubert@cumin1001 START - Cookbook sre.dns.wipe-cache restbase-async.discovery.wmnet on all recursors [11:55:16] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) restbase-async.discovery.wmnet on all recursors [11:55:30] (03CR) 10CI reject: [V: 04-1] Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:56:06] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [11:57:28] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1039.eqiad.wmnet with reason: host reimage [11:57:29] 10SRE, 10serviceops, 10Patch-For-Review: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10akosiaris) [11:57:34] (03CR) 10Clément Goubert: "Forgot to mention you need to add them to utils/mock_etc/discovery-geo-resources" [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [11:58:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10akosiaris) [11:58:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T329203)', diff saved to https://phabricator.wikimedia.org/P45473 and previous config saved to /var/cache/conftool/dbconfig/20230308-115815-marostegui.json [11:58:18] (03PS8) 10Hnowlan: Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) [11:58:20] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:58:46] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40008/console" [puppet] - 10https://gerrit.wikimedia.org/r/895135 (https://phabricator.wikimedia.org/T331345) (owner: 10Ottomata) [11:58:47] 10SRE, 10serviceops, 10Patch-For-Review: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10akosiaris) 05Open→03Resolved Nodes added, resolving. Many thanks @jijiki [11:59:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T328817)', diff saved to https://phabricator.wikimedia.org/P45474 and previous config saved to /var/cache/conftool/dbconfig/20230308-115903-marostegui.json [11:59:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [11:59:08] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:59:08] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Install spark3 and conda-analytics on all analytics cluster airflow nodes [puppet] - 10https://gerrit.wikimedia.org/r/895135 (https://phabricator.wikimedia.org/T331345) (owner: 10Ottomata) [11:59:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T329260)', diff saved to https://phabricator.wikimedia.org/P45475 and previous config saved to /var/cache/conftool/dbconfig/20230308-115913-marostegui.json [11:59:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [11:59:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [11:59:19] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [11:59:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T328817)', diff saved to https://phabricator.wikimedia.org/P45476 and previous config saved to /var/cache/conftool/dbconfig/20230308-115924-marostegui.json [11:59:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [11:59:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T329260)', diff saved to https://phabricator.wikimedia.org/P45477 and previous config saved to /var/cache/conftool/dbconfig/20230308-115935-marostegui.json [12:00:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in codfw: T330651 [12:00:19] T330651: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 [12:01:07] (03PS1) 10Jbond: do not merger: example of phabricator pcc run [puppet] - 10https://gerrit.wikimedia.org/r/895722 [12:01:43] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1039.eqiad.wmnet with reason: host reimage [12:01:46] !log restbase-async back in standard state - T330651 [12:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:14] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [12:02:23] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:02:31] 10SRE, 10serviceops, 10Datacenter-Switchover: 28 February 2023 Service Switchover checklist - https://phabricator.wikimedia.org/T330651 (10Clement_Goubert) 05In progress→03Resolved [12:02:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40009/console" [puppet] - 10https://gerrit.wikimedia.org/r/895722 (owner: 10Jbond) [12:02:53] (03CR) 10Hnowlan: [C: 03+2] Add service records for device-analytics. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [12:03:47] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) 05Open→03Resolved We are now out of the window of eqiad complete depool, according to schedule. [12:03:55] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [12:04:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T329260)', diff saved to https://phabricator.wikimedia.org/P45478 and previous config saved to /var/cache/conftool/dbconfig/20230308-120406-marostegui.json [12:04:49] (03PS2) 10Jbond: do not merger: example of phabricator pcc run [puppet] - 10https://gerrit.wikimedia.org/r/895722 [12:04:54] (03PS1) 10MVernon: admin: add s-mukuti to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/895723 (https://phabricator.wikimedia.org/T331402) [12:05:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40010/console" [puppet] - 10https://gerrit.wikimedia.org/r/895722 (owner: 10Jbond) [12:07:21] (03PS3) 10Clément Goubert: trafficserver: move testwikidata to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) [12:08:04] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [12:08:50] (03CR) 10Hnowlan: [C: 03+2] conftool-data: add device-analytics service [puppet] - 10https://gerrit.wikimedia.org/r/895716 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [12:09:31] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10TheDJ) wrt to pre render. I'm assuming that is wgUploadThumbnailRenderMap at work. There are details on the various sizes... [12:09:34] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Now `00297: FAILED: stashfailed: An unknown error occurred in storage backend "local-swift-eqiad".... [12:09:44] !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host urldownloader1003.wikimedia.org with OS bullseye [12:09:49] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host urldownloader1003.wikimedia.org with OS bullseye executed with errors: - urldownloader1003 (**F... [12:10:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T328817)', diff saved to https://phabricator.wikimedia.org/P45479 and previous config saved to /var/cache/conftool/dbconfig/20230308-121009-marostegui.json [12:10:14] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:10:36] !log hnowlan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for device-analytics - hnowlan@cumin1001" [12:10:39] (03PS1) 10Cathal Mooney: Return port blocks data for both QFX5120-48Y Netbox device types [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/895725 (https://phabricator.wikimedia.org/T331519) [12:10:47] (03CR) 10CI reject: [V: 04-1] Return port blocks data for both QFX5120-48Y Netbox device types [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/895725 (https://phabricator.wikimedia.org/T331519) (owner: 10Cathal Mooney) [12:10:59] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Insert a header for specific domains at haproxy layer to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Clement_Goubert) [12:13:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P45480 and previous config saved to /var/cache/conftool/dbconfig/20230308-121321-marostegui.json [12:13:28] (03PS2) 10Cathal Mooney: Return port blocks data for both QFX5120-48Y Netbox device types [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/895725 (https://phabricator.wikimedia.org/T331519) [12:14:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host urldownloader1003.wikimedia.org with OS bullseye [12:14:19] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host urldownloader1003.wikimedia.org with OS bullseye [12:15:36] (03PS1) 10Muehlenhoff: Extend insetup alias to also include serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/895746 [12:16:43] (03PS1) 10Slyngshede: C:idm::deployment fix LDAP configuration [puppet] - 10https://gerrit.wikimedia.org/r/895747 [12:17:50] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40011/console" [puppet] - 10https://gerrit.wikimedia.org/r/895747 (owner: 10Slyngshede) [12:17:59] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1039.eqiad.wmnet with OS bullseye [12:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P45482 and previous config saved to /var/cache/conftool/dbconfig/20230308-121912-marostegui.json [12:20:22] (03PS1) 10Alexandros Kosiaris: istio wikikube: Add the proper tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 [12:21:13] (03PS1) 10Ottomata: mediawiki-page-content-change-enrichment - bump to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/895749 (https://phabricator.wikimedia.org/T330994) [12:21:21] (03CR) 10Vgutierrez: "this one could be added to the existing test2.wp.o map rules leveraging regex_map instead of duplicating the whole thing" [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [12:21:33] (03PS2) 10Slyngshede: C:idm::deployment fix LDAP configuration [puppet] - 10https://gerrit.wikimedia.org/r/895747 [12:22:27] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for device-analytics - hnowlan@cumin1001" [12:22:27] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:22:35] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40012/console" [puppet] - 10https://gerrit.wikimedia.org/r/895747 (owner: 10Slyngshede) [12:22:52] (03CR) 10Muehlenhoff: [C: 03+2] Extend insetup alias to also include serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/895746 (owner: 10Muehlenhoff) [12:24:24] (03CR) 10Slyngshede: [V: 03+1] "Maybe someone can explain why lookup('ldap', Hash, hash, {}), pulls ldap configurations for commons, while lookup('ldap') pulls for eqiad " [puppet] - 10https://gerrit.wikimedia.org/r/895747 (owner: 10Slyngshede) [12:24:32] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) I see two successful PUTs for that object (one per DC), and indeed it seems to be success... [12:25:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/895723 (https://phabricator.wikimedia.org/T331402) (owner: 10MVernon) [12:25:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P45483 and previous config saved to /var/cache/conftool/dbconfig/20230308-122515-marostegui.json [12:25:57] (03CR) 10MVernon: [C: 03+2] admin: add s-mukuti to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/895723 (https://phabricator.wikimedia.org/T331402) (owner: 10MVernon) [12:26:18] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - bump to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/895749 (https://phabricator.wikimedia.org/T330994) (owner: 10Ottomata) [12:28:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P45484 and previous config saved to /var/cache/conftool/dbconfig/20230308-122827-marostegui.json [12:28:43] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for s-mukuti - https://phabricator.wikimedia.org/T331402 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @S_Mukuti this is all done for you now. [12:30:00] 10SRE, 10Observability-Metrics, 10User-CDanis: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10fgiunchedi) [12:30:55] (03Merged) 10jenkins-bot: mediawiki-page-content-change-enrichment - bump to 1.1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/895749 (https://phabricator.wikimedia.org/T330994) (owner: 10Ottomata) [12:31:06] !log running authdns-update for r/890398 [12:31:07] (03PS1) 10Muehlenhoff: Extend dumps alias [puppet] - 10https://gerrit.wikimedia.org/r/895751 [12:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:17] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Insert a header for specific domains at haproxy layer to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Vgutierrez) We traditionally perform that kind of header mangling in varnish rather than on the TLS termination layer as we try... [12:34:10] (03PS6) 10Hnowlan: service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) [12:34:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P45485 and previous config saved to /var/cache/conftool/dbconfig/20230308-123418-marostegui.json [12:36:56] (03CR) 10JMeybohm: "Out of curiosity: Did an actual problem surfaced out of this? I was under the impression that this is not required. Kask does not use ingr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris) [12:39:39] (03PS1) 10Slyngshede: SUL linking: User may noget have the correct attribute. [software/bitu] - 10https://gerrit.wikimedia.org/r/895753 [12:40:08] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10jijiki) [12:40:12] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Find a sensible way to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Clement_Goubert) [12:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P45486 and previous config saved to /var/cache/conftool/dbconfig/20230308-124021-marostegui.json [12:40:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Find a sensible way to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Clement_Goubert) Changed the task title to reflect the direction of the discussion. [12:42:36] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [12:43:16] (03PS1) 10Ottomata: mediawiki-page-content-change-enrichment - use kafka 9092 until we can use TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/895754 (https://phabricator.wikimedia.org/T331526) [12:43:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T329203)', diff saved to https://phabricator.wikimedia.org/P45487 and previous config saved to /var/cache/conftool/dbconfig/20230308-124334-marostegui.json [12:43:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:43:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:43:39] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [12:43:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T329203)', diff saved to https://phabricator.wikimedia.org/P45488 and previous config saved to /var/cache/conftool/dbconfig/20230308-124344-marostegui.json [12:44:15] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [12:45:16] (03PS4) 10Clément Goubert: trafficserver: move testwikidata to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) [12:46:26] (03CR) 10Ottomata: [V: 03+2 C: 03+2] mediawiki-page-content-change-enrichment - use kafka 9092 until we can use TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/895754 (https://phabricator.wikimedia.org/T331526) (owner: 10Ottomata) [12:47:39] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40014/console" [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [12:48:42] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:48:47] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [12:49:19] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40015/console" [puppet] - 10https://gerrit.wikimedia.org/r/895240 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [12:49:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T329260)', diff saved to https://phabricator.wikimedia.org/P45489 and previous config saved to /var/cache/conftool/dbconfig/20230308-124924-marostegui.json [12:49:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [12:49:29] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [12:49:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [12:49:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T329260)', diff saved to https://phabricator.wikimedia.org/P45490 and previous config saved to /var/cache/conftool/dbconfig/20230308-124945-marostegui.json [12:50:34] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:51:57] (03CR) 10Clément Goubert: trafficserver: move testwikidata to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [12:52:26] (03CR) 10Clément Goubert: [V: 03+1] "Re-adding +1 from PCC success" [puppet] - 10https://gerrit.wikimedia.org/r/894529 (https://phabricator.wikimedia.org/T331268) (owner: 10Clément Goubert) [12:53:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two typos inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/895753 (owner: 10Slyngshede) [12:54:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T329260)', diff saved to https://phabricator.wikimedia.org/P45491 and previous config saved to /var/cache/conftool/dbconfig/20230308-125422-marostegui.json [12:55:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T329203)', diff saved to https://phabricator.wikimedia.org/P45492 and previous config saved to /var/cache/conftool/dbconfig/20230308-125515-marostegui.json [12:55:21] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [12:55:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T328817)', diff saved to https://phabricator.wikimedia.org/P45493 and previous config saved to /var/cache/conftool/dbconfig/20230308-125527-marostegui.json [12:55:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [12:55:33] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:55:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [12:55:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T328817)', diff saved to https://phabricator.wikimedia.org/P45494 and previous config saved to /var/cache/conftool/dbconfig/20230308-125548-marostegui.json [13:00:43] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:00:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:02:06] (03PS2) 10Slyngshede: SUL linking: LDAP user object may no have the correct attribute. [software/bitu] - 10https://gerrit.wikimedia.org/r/895753 [13:02:10] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:02:17] (03CR) 10Slyngshede: SUL linking: LDAP user object may no have the correct attribute. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/895753 (owner: 10Slyngshede) [13:02:17] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:05:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:06:01] (03CR) 10Muehlenhoff: SUL linking: LDAP user object may no have the correct attribute. [software/bitu] - 10https://gerrit.wikimedia.org/r/895753 (owner: 10Slyngshede) [13:06:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T328817)', diff saved to https://phabricator.wikimedia.org/P45495 and previous config saved to /var/cache/conftool/dbconfig/20230308-130613-marostegui.json [13:06:19] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:06:39] 10SRE, 10LDAP-Access-Requests: Request access to the group ldap/wmf - https://phabricator.wikimedia.org/T331370 (10lwatson) Thanks for help! [13:08:51] (03PS36) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:08:53] (03PS1) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 [13:09:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P45496 and previous config saved to /var/cache/conftool/dbconfig/20230308-130928-marostegui.json [13:10:07] (03PS1) 10Jbond: pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757 [13:10:09] (03PS1) 10Jbond: service:catalogue: Add pki as an active active service [puppet] - 10https://gerrit.wikimedia.org/r/895758 (https://phabricator.wikimedia.org/T331523) [13:10:10] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: sync [13:10:15] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: sync [13:10:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P45497 and previous config saved to /var/cache/conftool/dbconfig/20230308-131022-marostegui.json [13:10:26] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:10:29] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:11:14] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: sync [13:11:19] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: sync [13:11:30] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (owner: 10David Caro) [13:11:49] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) >>! In T331138#8675706, @MatthewVernon wrote: >>>! In T331138#8675245, @Joe wrote: >> Also: if pre-generation... [13:13:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host urldownloader1003.wikimedia.org with OS bullseye [13:13:19] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host urldownloader1003.wikimedia.org with OS bullseye executed with errors: - urldownloader1003 (**F... [13:15:31] (03PS37) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [13:15:37] (03PS2) 10Jbond: pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757 [13:17:40] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) My bad. Pre-genarated ones are different from user perf ones. Two sizes can be dropped from pre-gen sizes with... [13:18:29] !log jayme@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:18:32] !log jayme@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:21:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P45498 and previous config saved to /var/cache/conftool/dbconfig/20230308-132120-marostegui.json [13:22:50] (03PS3) 10Jbond: pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757 [13:22:52] (03PS1) 10Jbond: wmflib: Add post to http methods [puppet] - 10https://gerrit.wikimedia.org/r/895762 [13:24:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P45499 and previous config saved to /var/cache/conftool/dbconfig/20230308-132434-marostegui.json [13:25:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P45500 and previous config saved to /var/cache/conftool/dbconfig/20230308-132528-marostegui.json [13:26:31] (03PS4) 10Jbond: pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757 [13:26:53] (03PS1) 10Slyngshede: sre.ganeti.reimage: Force sync Netbox after reinstall. [cookbooks] - 10https://gerrit.wikimedia.org/r/895763 (https://phabricator.wikimedia.org/T331478) [13:28:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40019/console" [puppet] - 10https://gerrit.wikimedia.org/r/895757 (owner: 10Jbond) [13:28:36] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: Force sync Netbox after reinstall. [cookbooks] - 10https://gerrit.wikimedia.org/r/895763 (https://phabricator.wikimedia.org/T331478) (owner: 10Slyngshede) [13:30:46] (03PS2) 10Slyngshede: sre.ganeti.reimage: Force sync Netbox after reinstall. [cookbooks] - 10https://gerrit.wikimedia.org/r/895763 (https://phabricator.wikimedia.org/T331478) [13:31:13] (03PS1) 10Muehlenhoff: Fix service name in auto restart [puppet] - 10https://gerrit.wikimedia.org/r/895764 [13:31:19] (03PS1) 10JMeybohm: Revert: Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895765 (https://phabricator.wikimedia.org/T326729) [13:32:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40020/console" [puppet] - 10https://gerrit.wikimedia.org/r/895757 (owner: 10Jbond) [13:32:38] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: Force sync Netbox after reinstall. [cookbooks] - 10https://gerrit.wikimedia.org/r/895763 (https://phabricator.wikimedia.org/T331478) (owner: 10Slyngshede) [13:34:28] (03PS3) 10Slyngshede: sre.ganeti.reimage: Force sync Netbox after reinstall. [cookbooks] - 10https://gerrit.wikimedia.org/r/895763 (https://phabricator.wikimedia.org/T331478) [13:36:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P45501 and previous config saved to /var/cache/conftool/dbconfig/20230308-133626-marostegui.json [13:36:59] (03CR) 10JMeybohm: "For the record: This created a diff in which helm tried to add the configmap "flink-session-cluster-main-tls-proxy-certs" (Source: flink-s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895765 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [13:37:17] (03PS2) 10Volans: sre.{idm,pdus,puppet}: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895211 [13:37:19] (03PS2) 10Volans: sre.loadbalancer.restart-pybal: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895212 [13:37:21] (03PS2) 10Volans: sre.discovery: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895213 [13:37:23] (03PS2) 10Volans: sre.wdqs.data-transfer: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895214 [13:37:25] (03PS2) 10Volans: sre.k8s.pool-depool-cluster: ignore mypy errors [cookbooks] - 10https://gerrit.wikimedia.org/r/895215 [13:37:27] (03PS2) 10Volans: tox: add mypy testing [cookbooks] - 10https://gerrit.wikimedia.org/r/895216 [13:38:32] (03CR) 10JMeybohm: [C: 03+2] Revert: Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895765 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [13:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T329260)', diff saved to https://phabricator.wikimedia.org/P45502 and previous config saved to /var/cache/conftool/dbconfig/20230308-133940-marostegui.json [13:39:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [13:39:46] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [13:39:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [13:40:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T329260)', diff saved to https://phabricator.wikimedia.org/P45503 and previous config saved to /var/cache/conftool/dbconfig/20230308-134002-marostegui.json [13:40:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T329203)', diff saved to https://phabricator.wikimedia.org/P45504 and previous config saved to /var/cache/conftool/dbconfig/20230308-134034-marostegui.json [13:40:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:40:39] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:40:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:41:05] 10SRE, 10ops-esams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [13:43:20] (03Merged) 10jenkins-bot: Revert: Remove the .Values.kubernetesApi hack [deployment-charts] - 10https://gerrit.wikimedia.org/r/895765 (https://phabricator.wikimedia.org/T326729) (owner: 10JMeybohm) [13:44:07] (03PS1) 10Arturo Borrero Gonzalez: openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) [13:44:18] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/895763 (https://phabricator.wikimedia.org/T331478) (owner: 10Slyngshede) [13:45:06] (03CR) 10Volans: [C: 03+2] "Just rebased resolving conflict, no changes." [cookbooks] - 10https://gerrit.wikimedia.org/r/895211 (owner: 10Volans) [13:45:18] (03CR) 10Volans: [C: 03+2] sre.loadbalancer.restart-pybal: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895212 (owner: 10Volans) [13:46:54] (03Merged) 10jenkins-bot: sre.{idm,pdus,puppet}: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895211 (owner: 10Volans) [13:46:56] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/895725 (https://phabricator.wikimedia.org/T331519) (owner: 10Cathal Mooney) [13:46:59] (03PS2) 10Arturo Borrero Gonzalez: openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) [13:47:09] (03Merged) 10jenkins-bot: sre.loadbalancer.restart-pybal: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895212 (owner: 10Volans) [13:47:21] (03CR) 10CI reject: [V: 04-1] openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) (owner: 10Arturo Borrero Gonzalez) [13:48:26] (03PS3) 10Arturo Borrero Gonzalez: openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) [13:49:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [13:49:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [13:49:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T329203)', diff saved to https://phabricator.wikimedia.org/P45505 and previous config saved to /var/cache/conftool/dbconfig/20230308-134945-marostegui.json [13:49:50] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:50:25] (03PS1) 10Volans: sre.idm.u2f: positional args don't have required [cookbooks] - 10https://gerrit.wikimedia.org/r/895790 [13:51:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T328817)', diff saved to https://phabricator.wikimedia.org/P45506 and previous config saved to /var/cache/conftool/dbconfig/20230308-135132-marostegui.json [13:51:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [13:51:38] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:51:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [13:51:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T328817)', diff saved to https://phabricator.wikimedia.org/P45507 and previous config saved to /var/cache/conftool/dbconfig/20230308-135153-marostegui.json [13:54:29] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:55:07] (03CR) 10Btullis: [C: 03+2] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [13:55:09] (03CR) 10Nicolas Fraison: [C: 03+1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [13:55:31] (03CR) 10Volans: [C: 03+2] sre.discovery: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895213 (owner: 10Volans) [13:55:51] (03CR) 10Volans: [C: 03+2] sre.wdqs.data-transfer: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895214 (owner: 10Volans) [13:56:58] (03PS1) 10Ottomata: Move wgEventStreams settings into ext-EventStreamConfig.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895792 (https://phabricator.wikimedia.org/T308932) [13:57:25] (03Merged) 10jenkins-bot: sre.discovery: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895213 (owner: 10Volans) [13:57:34] (03Merged) 10jenkins-bot: sre.wdqs.data-transfer: fix mypy issues [cookbooks] - 10https://gerrit.wikimedia.org/r/895214 (owner: 10Volans) [13:58:20] (03PS4) 10Slyngshede: sre.ganeti.reimage: Force sync Netbox after reinstall. [cookbooks] - 10https://gerrit.wikimedia.org/r/895763 (https://phabricator.wikimedia.org/T331478) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T1400). [14:00:05] Daimona, HouseOfM, and cmelo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] I can deploy :) [14:00:09] (03Merged) 10jenkins-bot: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:00:14] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: Force sync Netbox after reinstall. [cookbooks] - 10https://gerrit.wikimedia.org/r/895763 (https://phabricator.wikimedia.org/T331478) (owner: 10Slyngshede) [14:00:38] o/ [14:00:54] (03CR) 10Ottomata: "This moves the EventLogging configs back into the main file. wgEventStreams was the really long one, so hopefully this accomplishes the pu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895792 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [14:01:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895234 (https://phabricator.wikimedia.org/T327470) (owner: 10Daimona Eaytoy) [14:01:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T329203)', diff saved to https://phabricator.wikimedia.org/P45508 and previous config saved to /var/cache/conftool/dbconfig/20230308-140115-marostegui.json [14:01:21] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:01:28] Daimona: oh this is a beta only deploy :) [14:01:34] 0/ [14:01:44] (03CR) 10Jelto: [V: 03+1] "This should be a mvp of running a registry on one of the WMCS Shared runners. TLS, credentials and global configuration of the proxy for o" [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto) [14:01:53] (03Merged) 10jenkins-bot: Enable $wgCampaignEventsEnableMultipleOrganizers on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895234 (https://phabricator.wikimedia.org/T327470) (owner: 10Daimona Eaytoy) [14:02:22] Yup, hopefully an easy one :) [14:03:41] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:04:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T328817)', diff saved to https://phabricator.wikimedia.org/P45509 and previous config saved to /var/cache/conftool/dbconfig/20230308-140405-marostegui.json [14:04:10] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:05:04] (03PS4) 10Arturo Borrero Gonzalez: openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) [14:06:01] Daimona: will be live on beta once https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/93507/console completes [14:06:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/895790 (owner: 10Volans) [14:06:14] Cool, ty [14:06:29] (03CR) 10Jbond: [C: 03+2] wmflib: Add post to http methods [puppet] - 10https://gerrit.wikimedia.org/r/895762 (owner: 10Jbond) [14:07:00] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:07:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T329260)', diff saved to https://phabricator.wikimedia.org/P45510 and previous config saved to /var/cache/conftool/dbconfig/20230308-140727-marostegui.json [14:07:32] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [14:07:37] (03PS5) 10Arturo Borrero Gonzalez: openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) [14:07:47] Uh, already there? [14:07:58] yup [14:08:12] (03CR) 10Volans: [C: 03+2] sre.idm.u2f: positional args don't have required [cookbooks] - 10https://gerrit.wikimedia.org/r/895790 (owner: 10Volans) [14:08:29] Testing [14:09:57] (03Merged) 10jenkins-bot: sre.idm.u2f: positional args don't have required [cookbooks] - 10https://gerrit.wikimedia.org/r/895790 (owner: 10Volans) [14:11:12] It seems to be working AFAICS [14:11:42] Amir1: any objections to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/895792 ? [14:12:04] Daimona: woo \o/ [14:12:11] (03CR) 10Elukey: [C: 03+1] sre.k8s.pool-depool-cluster: ignore mypy errors [cookbooks] - 10https://gerrit.wikimedia.org/r/895215 (owner: 10Volans) [14:13:07] \o/ [14:13:41] Is there anything else we need to do? [14:13:43] Thanks @Diamona [14:13:45] (03CR) 10Volans: [C: 03+2] sre.k8s.pool-depool-cluster: ignore mypy errors [cookbooks] - 10https://gerrit.wikimedia.org/r/895215 (owner: 10Volans) [14:15:33] (03Merged) 10jenkins-bot: sre.k8s.pool-depool-cluster: ignore mypy errors [cookbooks] - 10https://gerrit.wikimedia.org/r/895215 (owner: 10Volans) [14:16:02] TheresNoTime: done with backport window? if so i'll deploy some config changes [14:16:09] Daimona: nope, that's all :) the deployment of config changes to the beta cluster is fairly lacklustre [14:16:17] ottomata: I am done, all yours [14:16:20] ty [14:16:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P45511 and previous config saved to /var/cache/conftool/dbconfig/20230308-141621-marostegui.json [14:16:27] !log close UTC afternoon backport window [14:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:30] Amazing, thank you TheresNoTime :) [14:16:48] (03CR) 10Ottomata: [C: 03+2] "I'd like to make some changes to wgEventStreams, and update documentation too. So I'm going to be bold and merge this. Feel free to reve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895792 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [14:17:31] (03Merged) 10jenkins-bot: Move wgEventStreams settings into ext-EventStreamConfig.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895792 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [14:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P45513 and previous config saved to /var/cache/conftool/dbconfig/20230308-141911-marostegui.json [14:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P45514 and previous config saved to /var/cache/conftool/dbconfig/20230308-142233-marostegui.json [14:23:27] RECOVERY - Disk space on deploy1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops [14:23:35] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) 05Resolved→03In progress [14:23:44] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [14:24:55] (03PS1) 10Ottomata: wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895796 (https://phabricator.wikimedia.org/T326536) [14:24:59] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10Volans) >>! In T331478#8674505, @Volans wrote: > From a quick look the current data is correct and doesn't error out: > ` >>>> node.pr... [14:25:27] (03CR) 10Ottomata: "Deploying this to test out error event production. We can still bikeshed name." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895796 (https://phabricator.wikimedia.org/T326536) (owner: 10Ottomata) [14:25:29] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [14:25:32] !log bking@cumin2002 powering down elastic1060-66 for re-rack T322082 [14:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:36] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [14:25:38] (03CR) 10Muehlenhoff: [C: 03+2] Fix service name in auto restart [puppet] - 10https://gerrit.wikimedia.org/r/895764 (owner: 10Muehlenhoff) [14:25:43] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10Volans) [14:26:13] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10ssingh) >>! In T331478#8676530, @Volans wrote: >>>! In T331478#8674505, @Volans wrote: >> From a quick look the current data is correc... [14:26:42] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895796 (https://phabricator.wikimedia.org/T326536) (owner: 10Ottomata) [14:27:16] (03Abandoned) 10Slyngshede: sre.ganeti.reimage: Force sync Netbox after reinstall. [cookbooks] - 10https://gerrit.wikimedia.org/r/895763 (https://phabricator.wikimedia.org/T331478) (owner: 10Slyngshede) [14:27:28] (03Merged) 10jenkins-bot: wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895796 (https://phabricator.wikimedia.org/T326536) (owner: 10Ottomata) [14:27:48] (03PS3) 10Volans: tox: add mypy testing [cookbooks] - 10https://gerrit.wikimedia.org/r/895216 [14:28:19] (03PS1) 10Btullis: Remove an invalid namespace definition from the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/895799 (https://phabricator.wikimedia.org/T318926) [14:28:25] (03PS1) 10Ottomata: mediawiki-page-content-change-enrichment - set error-destination [deployment-charts] - 10https://gerrit.wikimedia.org/r/895800 (https://phabricator.wikimedia.org/T326536) [14:28:38] (03CR) 10Slyngshede: [C: 03+2] SUL linking: LDAP user object may no have the correct attribute. [software/bitu] - 10https://gerrit.wikimedia.org/r/895753 (owner: 10Slyngshede) [14:28:40] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] SUL linking: LDAP user object may no have the correct attribute. [software/bitu] - 10https://gerrit.wikimedia.org/r/895753 (owner: 10Slyngshede) [14:29:09] (03PS3) 10Slyngshede: C:idm::deployment fix LDAP configuration [puppet] - 10https://gerrit.wikimedia.org/r/895747 [14:29:26] (03CR) 10Nicolas Fraison: [C: 03+1] Remove an invalid namespace definition from the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/895799 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:29:37] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:05] (03CR) 10Stevemunene: [C: 03+1] Remove an invalid namespace definition from the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/895799 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:30:49] TheresNoTime: q for you because you are here. I need to do a config change that touches multiple files. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/895792 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/895796 [14:30:56] can I use scap-sync file with multiple files? [14:31:27] RECOVERY - Check systemd state on arclamp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P45516 and previous config saved to /var/cache/conftool/dbconfig/20230308-143127-marostegui.json [14:32:02] I guess i can sync the full wmf-config directory? [14:32:05] ottomata: looking, but it really depends if the changes are dependent on each other (there's no guarantee on the order files are sync'd) [14:32:10] ottomata: you have to know the sync order, I'd actually abandon that patch and to it in multiple steps otherwise you'll cause a full outage [14:32:26] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:32:34] (that's why it would have been better to be reviewed before merging) [14:32:40] (defer to Amir/1) [14:32:43] Amir1: okay. how did you do yours? [14:32:44] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:32:56] (03CR) 10Volans: [C: 03+2] tox: add mypy testing [cookbooks] - 10https://gerrit.wikimedia.org/r/895216 (owner: 10Volans) [14:32:57] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:33:09] yours was one patch? [14:33:23] I didn't rename anything [14:33:32] anyway [14:33:41] but, you have multiple files to deploy to make it work [14:33:41] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:33:57] yes, for that, if you look in the ticket, I do three syncs [14:34:03] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:34:07] but again that's for when you don't rename the file [14:34:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P45517 and previous config saved to /var/cache/conftool/dbconfig/20230308-143418-marostegui.json [14:34:21] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:34:22] it should have been one patch to rename the file, and one patch to move things around [14:34:49] (03CR) 10Muehlenhoff: C:idm::deployment fix LDAP configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895747 (owner: 10Slyngshede) [14:34:56] (03CR) 10Btullis: [C: 03+2] Remove an invalid namespace definition from the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/895799 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:34:57] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:34:58] (03Merged) 10jenkins-bot: tox: add mypy testing [cookbooks] - 10https://gerrit.wikimedia.org/r/895216 (owner: 10Volans) [14:35:02] looking... [14:35:05] e.g. https://phabricator.wikimedia.org/T308932#8601067 [14:35:06] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm move OIDC endpoint to variable. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/891242 (owner: 10Slyngshede) [14:35:17] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:35:39] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:35:55] Amir1: between when you synced MWConfigCacheGenerator and InitialiseSettings.php, did it just work okay because the duplicate configs were merged? [14:36:07] yes [14:36:11] got it. [14:36:27] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:36:53] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:37:09] need to deploy a lot of things, apologies for the spam [14:37:31] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:37:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P45518 and previous config saved to /var/cache/conftool/dbconfig/20230308-143739-marostegui.json [14:37:52] okay got it, so revert these changes, make a new ext-EventStreamConfig.php but also keep ext-EventLogging.php as is [14:37:57] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:38:20] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10Volans) We do have an understanding of the issue, we're discussing how to fix it. It's basically inconsistent data in netbox. [14:38:29] and do the remove of ext-EventStreamConfig.php last. (can I sync-file a remove?) or just ignore the fact that it was removed, since MWConfigCacheGenerator will ignore it after that? [14:38:42] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:39:09] (03PS1) 10Ottomata: Revert "Move wgEventStreams settings into ext-EventStreamConfig.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895769 [14:39:17] (03CR) 10CI reject: [V: 04-1] Revert "Move wgEventStreams settings into ext-EventStreamConfig.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895769 (owner: 10Ottomata) [14:39:22] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:39:29] "since MWConfigCacheGenerator will ignore it after that" it'll cause outage [14:39:43] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:39:47] (03Merged) 10jenkins-bot: Remove an invalid namespace definition from the spark-operator chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/895799 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:40:08] Amir1: i think i need two patches, one with all the files in place and configs dupcliated. sync those. then one with removal of extraneous files and configs. sync those. ya? [14:40:09] (03Abandoned) 10Slyngshede: Remove duplicate installed apps from base settings. [software/bitu] - 10https://gerrit.wikimedia.org/r/895221 (owner: 10Slyngshede) [14:40:11] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:40:30] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:40:38] you need three [14:41:32] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:41:35] the "then one with removal of extraneous files and configs." needs to be split into two, one to remove MWConfigCacheGenerator and sync from the file and then one to delete the file (you do sync dir) [14:41:40] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:41:40] 1. new and duplicate files and configs. [14:41:40] 2. MWconfigCache removal of loading ext-EventLogging [14:41:40] 3. removal of files [14:41:40] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:41:41] ? [14:41:49] got it [14:41:55] oh sync dir okay [14:42:05] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:42:10] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:42:12] for the first one, you still need two or three syncs btw [14:42:17] the way I did it [14:42:21] (03PS1) 10Ottomata: Revert "wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895770 [14:42:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:42:29] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:42:37] k let me get patches in order and will double check syncs with you [14:42:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:43:14] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:43:18] (03CR) 10Ottomata: [C: 03+2] Revert "wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895770 (owner: 10Ottomata) [14:44:02] (03Merged) 10jenkins-bot: Revert "wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895770 (owner: 10Ottomata) [14:44:19] (03PS2) 10Ottomata: Revert "Move wgEventStreams settings into ext-EventStreamConfig.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895769 [14:44:30] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:45:17] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:46:06] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:46:22] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:46:32] 10SRE, 10Cloud-Services, 10Traffic, 10cloud-services-team: Horizon/lvs alerts the wrong people (and also is generally too sensitive) - https://phabricator.wikimedia.org/T331197 (10Andrew) 05Open→03Resolved I believe the most urgent version of this task is resolved. [14:46:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T329203)', diff saved to https://phabricator.wikimedia.org/P45519 and previous config saved to /var/cache/conftool/dbconfig/20230308-144634-marostegui.json [14:46:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:46:39] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:46:49] (03CR) 10Ottomata: [C: 03+2] Revert "Move wgEventStreams settings into ext-EventStreamConfig.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895769 (owner: 10Ottomata) [14:46:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:46:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:46:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:47:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T329203)', diff saved to https://phabricator.wikimedia.org/P45520 and previous config saved to /var/cache/conftool/dbconfig/20230308-144659-marostegui.json [14:47:25] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:36] (03Merged) 10jenkins-bot: Revert "Move wgEventStreams settings into ext-EventStreamConfig.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895769 (owner: 10Ottomata) [14:47:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:48:30] i'm going to do a quick simple config change deploymentt before this refactor, so I can work on another thing too [14:49:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T328817)', diff saved to https://phabricator.wikimedia.org/P45521 and previous config saved to /var/cache/conftool/dbconfig/20230308-144924-marostegui.json [14:49:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [14:49:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [14:49:29] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:49:33] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) [14:49:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T328817)', diff saved to https://phabricator.wikimedia.org/P45522 and previous config saved to /var/cache/conftool/dbconfig/20230308-144934-marostegui.json [14:49:36] (03PS1) 10Ottomata: wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895803 (https://phabricator.wikimedia.org/T326536) [14:50:36] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895803 (https://phabricator.wikimedia.org/T326536) (owner: 10Ottomata) [14:50:47] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [14:50:57] (03PS1) 10Slyngshede: sre.ganeti.reimage do not clear DHCP cache. [cookbooks] - 10https://gerrit.wikimedia.org/r/895804 (https://phabricator.wikimedia.org/T331478) [14:50:58] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:51:01] (03CR) 10Eevans: [C: 03+1] service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [14:51:19] (03Merged) 10jenkins-bot: wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895803 (https://phabricator.wikimedia.org/T326536) (owner: 10Ottomata) [14:51:35] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:22] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [14:52:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:52:41] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [14:52:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T329260)', diff saved to https://phabricator.wikimedia.org/P45523 and previous config saved to /var/cache/conftool/dbconfig/20230308-145245-marostegui.json [14:52:50] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [14:52:52] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:52:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:53:01] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/895804 (https://phabricator.wikimedia.org/T331478) (owner: 10Slyngshede) [14:55:20] (03CR) 10Slyngshede: [C: 03+2] sre.ganeti.reimage do not clear DHCP cache. [cookbooks] - 10https://gerrit.wikimedia.org/r/895804 (https://phabricator.wikimedia.org/T331478) (owner: 10Slyngshede) [14:55:29] (03PS1) 10Ottomata: wgEventStreams - fix typo in new error stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895805 (https://phabricator.wikimedia.org/T326536) [14:55:40] (03CR) 10Ottomata: [V: 03+2 C: 03+2] wgEventStreams - fix typo in new error stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895805 (https://phabricator.wikimedia.org/T326536) (owner: 10Ottomata) [14:57:22] (03Merged) 10jenkins-bot: sre.ganeti.reimage do not clear DHCP cache. [cookbooks] - 10https://gerrit.wikimedia.org/r/895804 (https://phabricator.wikimedia.org/T331478) (owner: 10Slyngshede) [15:00:30] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10SLyngshede-WMF) @BCornwall / @ssingh we've removed the clear dhcp cache part of the cookbook. It's technically not need at this point,... [15:00:37] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10SLyngshede-WMF) 05Open→03Resolved [15:01:38] wow sync file takes a long time now, (mw on k8s?) [15:01:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T328817)', diff saved to https://phabricator.wikimedia.org/P45524 and previous config saved to /var/cache/conftool/dbconfig/20230308-150150-marostegui.json [15:01:56] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:03:08] ottomata: imagine that during the full outage, that's why sync order is more important than before. The time it takes now is because of being in the migration stage of wikikube + bare metal [15:03:31] aye got it [15:03:50] i have to do another sync file now too because i made a typo in my config change ( non outage causing) [15:04:13] <_joe_> no, sync file should require about 2-3 minutes more than before [15:04:19] <_joe_> if it takes more something was wrong [15:04:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:05:00] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:05:07] !log otto@deploy2002 Synchronized wmf-config/ext-EventLogging.php: wgEventStreams - Declare rc1.enrichment.mediawiki_page_content_change.error stream - T326536 (duration: 11m 33s) [15:05:12] T326536: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 [15:05:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [15:05:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:05:55] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [15:05:58] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:06:07] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:06:10] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:06:22] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [15:06:32] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:07:13] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Joe) >>! In T331138#8675719, @MatthewVernon wrote: > One further thought - it would be nice if we could take swift's spec... [15:08:59] PROBLEM - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:09:42] (03PS2) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 [15:11:03] _joe_: it took 12 minutes [15:11:04] https://gitlab.wikimedia.org/-/snippets/64 [15:11:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:11:59] <_joe_> ottomata: aah damn I think your was the first deployment after we've rebuilt the k8s nodes in eqiad [15:12:09] (03PS15) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [15:12:13] <_joe_> but next deployment should be faster [15:12:18] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF) p:05Triage→03High [15:12:26] <_joe_> usually for small patches we're adding about 2-3 minutes per deployment [15:12:35] (03CR) 10JHathaway: [C: 03+2] kernel-purge: enable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894729 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [15:12:39] <_joe_> where "small" means "anything that doesn't regenerate l10n" [15:12:49] (03CR) 10Jaime Nuche: "Thank you to both!" [puppet] - 10https://gerrit.wikimedia.org/r/894542 (owner: 10Jaime Nuche) [15:13:13] _joe_: okay! i'm about to do a small one! [15:13:14] lets see [15:14:10] <_joe_> also please use phab's paste [15:14:19] <_joe_> its UI is much friendlier :) [15:14:24] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) [15:14:37] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) [15:14:54] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10JArguello-WMF) [15:14:58] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10jhathaway) 05Open→03Resolved This is now enabled in production. [15:15:33] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P45525 and previous config saved to /var/cache/conftool/dbconfig/20230308-151656-marostegui.json [15:16:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:18:00] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Clement_Goubert) [15:18:41] (03PS3) 10Esanders: Enable history page visual diffs everywhere except Wikipedias and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888804 (https://phabricator.wikimedia.org/T314588) [15:20:41] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Clement_Goubert) [15:20:58] (03PS1) 10Jbond: ldap: move ldap lookup_options to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/895811 [15:21:19] (03CR) 10CI reject: [V: 04-1] ldap: move ldap lookup_options to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond) [15:21:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:22:22] (03PS3) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 [15:22:24] (03PS1) 10David Caro: maintain-dbusers: fix systemd service description [puppet] - 10https://gerrit.wikimedia.org/r/895814 [15:22:44] !log otto@deploy2002 Synchronized wmf-config/ext-EventLogging.php: wgEventStreams - Fix typo in rc1.enrichment.mediawiki_page_content_change.error stream - T326536 (duration: 06m 41s) [15:22:51] T326536: Streaming services errors should be routed to an error event topic. - https://phabricator.wikimedia.org/T326536 [15:23:02] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10cmooney) [15:23:42] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1060.eqiad.wmnet'] [15:25:16] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond) [15:25:24] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10cmooney) [15:25:44] (03CR) 10Raymond Ndibe: [C: 03+1] "You can merge when you want. Can't merge puppet yet" [puppet] - 10https://gerrit.wikimedia.org/r/895756 (owner: 10David Caro) [15:25:59] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10cmooney) [15:26:19] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1061.eqiad.wmnet'] [15:26:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT deployments) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:29:12] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [15:29:37] (03PS2) 10Jbond: ldap: move ldap lookup_options to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/895811 [15:29:49] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [15:30:32] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) p:05Triage→03Medium [15:30:45] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10ssingh) >>! In T331478#8676676, @SLyngshede-WMF wrote: > @BCornwall / @ssingh we've removed the clear dhcp cache part of the cookbook. It's technically not... [15:31:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [15:31:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [15:31:48] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic1060.eqiad.wmnet'] [15:32:03] 10SRE-swift-storage, 10MediaWiki-File-management, 10Unstewarded-production-error: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10thcipriani) >>! In T331138#8675246, @Joe wrote: > @thcipriani what does the #unstewarded-production-error tag mean, in pr... [15:32:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P45526 and previous config saved to /var/cache/conftool/dbconfig/20230308-153202-marostegui.json [15:33:21] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1061.eqiad.wmnet'] [15:34:19] (03PS1) 10Muehlenhoff: Sync more clamd.conf settings from 0.103.8 [puppet] - 10https://gerrit.wikimedia.org/r/895815 (https://phabricator.wikimedia.org/T330129) [15:41:01] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40027/console" [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [15:41:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM overall; a couple comments inline." [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [15:42:20] _joe_: indeed was faster this time, but still was 7 minutes :) [15:42:24] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40028/console" [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [15:42:28] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1062.eqiad.wmnet'] [15:42:37] <_joe_> ottomata: of which about 3 should come from k8s deployments [15:43:00] <_joe_> ottomata: now, the issue IMHO is that we use code deployments in lieu of a backoffice for a LOT of stuff [15:43:35] backoffice... [15:44:02] what's that mean? [15:44:12] (03CR) 10Ottomata: [C: 03+2] mediawiki-page-content-change-enrichment - set error-destination [deployment-charts] - 10https://gerrit.wikimedia.org/r/895800 (https://phabricator.wikimedia.org/T326536) (owner: 10Ottomata) [15:44:55] I'd say anything but flat config files that are deployed like code [15:45:02] (03CR) 10Muehlenhoff: [C: 03+2] thumbor: temporarily disable Scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/894542 (owner: 10Jaime Nuche) [15:45:15] So k/v systems, DB, whatever [15:45:40] but _joe_ may have a different definition [15:45:58] (03PS1) 10David Caro: replica_cnf_api: skip tool account that don't have a home [puppet] - 10https://gerrit.wikimedia.org/r/895818 [15:46:06] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1063.eqiad.wmnet'] [15:46:17] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] service, k8s: add service configuration for AQS2 service device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/889960 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [15:46:29] (03PS2) 10David Caro: replica_cnf_api: skip tool account that don't have a home [puppet] - 10https://gerrit.wikimedia.org/r/895818 (https://phabricator.wikimedia.org/T303663) [15:46:39] (03PS4) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) [15:46:51] (03PS2) 10David Caro: maintain-dbusers: fix systemd service description [puppet] - 10https://gerrit.wikimedia.org/r/895814 [15:46:58] (03PS3) 10David Caro: maintain-dbusers: fix systemd service description [puppet] - 10https://gerrit.wikimedia.org/r/895814 (https://phabricator.wikimedia.org/T303663) [15:47:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T328817)', diff saved to https://phabricator.wikimedia.org/P45527 and previous config saved to /var/cache/conftool/dbconfig/20230308-154709-marostegui.json [15:47:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [15:47:16] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:47:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T329203)', diff saved to https://phabricator.wikimedia.org/P45528 and previous config saved to /var/cache/conftool/dbconfig/20230308-154724-marostegui.json [15:47:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [15:47:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:47:30] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:47:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:47:31] <_joe_> ottomata: a place where people can add throttling rules, upload logos, add new wikis, change wikiversions automatically... [15:47:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T328817)', diff saved to https://phabricator.wikimedia.org/P45529 and previous config saved to /var/cache/conftool/dbconfig/20230308-154736-marostegui.json [15:47:44] <_joe_> you know, the admin panel any site has [15:47:51] _joe_: declare streams in event stream config? [15:47:57] Ah no, see he was really talking about a backoffice :'D [15:48:04] <_joe_> yes [15:48:06] <_joe_> :) [15:48:07] interesting. [15:48:23] i kinda thought we didn't like UIs for that cuz then we lose things like git history and easy reverts? [15:48:31] <_joe_> I mean [15:48:43] <_joe_> it's possible to write audit logs for configuration databases [15:48:53] <_joe_> and make it revertable [15:48:55] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: skip tool account that don't have a home [puppet] - 10https://gerrit.wikimedia.org/r/895818 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [15:48:57] (03Merged) 10jenkins-bot: mediawiki-page-content-change-enrichment - set error-destination [deployment-charts] - 10https://gerrit.wikimedia.org/r/895800 (https://phabricator.wikimedia.org/T326536) (owner: 10Ottomata) [15:49:11] hm, sure. do people do that? [15:49:59] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:50:08] (03PS3) 10David Caro: replica_cnf_api: skip tool account that don't have a home [puppet] - 10https://gerrit.wikimedia.org/r/895818 (https://phabricator.wikimedia.org/T303663) [15:50:13] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:50:30] (03PS1) 10Jbond: P:blackbox_exporter: update client auth checks to use local certs [puppet] - 10https://gerrit.wikimedia.org/r/895821 [15:52:00] (03PS2) 10Jbond: P:blackbox_exporter: update client auth checks to use local certs [puppet] - 10https://gerrit.wikimedia.org/r/895821 [15:52:02] (03PS5) 10Jbond: pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757 [15:52:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1062.eqiad.wmnet'] [15:52:14] (03PS6) 10Jbond: pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757 [15:52:20] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1062.eqiad.wmnet'] [15:53:03] (03PS3) 10Jbond: P:blackbox_exporter: update client auth checks to use local certs [puppet] - 10https://gerrit.wikimedia.org/r/895821 [15:53:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40029/console" [puppet] - 10https://gerrit.wikimedia.org/r/895757 (owner: 10Jbond) [15:54:24] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic1061.eqiad.wmnet [15:54:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40030/console" [puppet] - 10https://gerrit.wikimedia.org/r/895821 (owner: 10Jbond) [15:54:43] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1063.eqiad.wmnet'] [15:55:09] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1063.eqiad.wmnet'] [15:55:26] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10cmooney) >>! In T331478#8676676, @SLyngshede-WMF wrote: > @BCornwall / @ssingh we've removed the clear dhcp cache part of the cookbook. It's technically not... [15:56:34] (03CR) 10Jbond: P:blackbox_exporter: update client auth checks to use local certs [puppet] - 10https://gerrit.wikimedia.org/r/895821 (owner: 10Jbond) [15:57:03] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10Cmjohnson) 05Open→03Resolved The disk has been swapped and back online. I am resolving this task and creating a new one for the BBU. [15:57:54] 10SRE-tools, 10Infrastructure-Foundations: Ganeti reimage cookbook exception when running _clear_dhcp_cache - https://phabricator.wikimedia.org/T331478 (10cmooney) [15:58:05] RECOVERY - MegaRAID on an-worker1132 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:58:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:58:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:58:36] 10ops-eqiad: anworker1132 BBU issue/replacement - https://phabricator.wikimedia.org/T331543 (10Cmjohnson) [15:59:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10MatthewVernon) Hi @Jclark-ctr any news on getting these frontends ready for use, please? [15:59:29] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T330971 (10nfraison) Strangely since the change of disk everything is back to normal ` RECOVERY - MegaRAID on an-worker1132 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech... [15:59:44] (03PS1) 10Herron: grafana: serve grafana/grafana-rw from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/895772 [15:59:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic1062.eqiad.wmnet'] [16:00:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1061.eqiad.wmnet [16:00:28] (03PS1) 10Hnowlan: kubernetes: add stub values for device-analytics [labs/private] - 10https://gerrit.wikimedia.org/r/895824 (https://phabricator.wikimedia.org/T320967) [16:00:49] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic1062.eqiad.wmnet [16:01:11] (03PS1) 10Herron: grafana: add -next suffix to codfw grafana domains names [puppet] - 10https://gerrit.wikimedia.org/r/895773 [16:01:20] (03CR) 10Herron: "note: to be merged after I44ca0e3257febcb45c59baa9f57022340fb6266d has propagated" [puppet] - 10https://gerrit.wikimedia.org/r/895773 (owner: 10Herron) [16:01:21] 10ops-eqiad: anworker1132 BBU issue/replacement - https://phabricator.wikimedia.org/T331543 (10RhinosF1) 05Open→03Stalled Per other task, doesn’t seem actually needed. [16:02:16] 10ops-eqiad: anworker1132 BBU issue/replacement - https://phabricator.wikimedia.org/T331543 (10RhinosF1) This task was split from {T330971} [16:02:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T328817)', diff saved to https://phabricator.wikimedia.org/P45530 and previous config saved to /var/cache/conftool/dbconfig/20230308-160221-marostegui.json [16:02:26] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:02:27] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1061.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:02:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P45531 and previous config saved to /var/cache/conftool/dbconfig/20230308-160231-marostegui.json [16:03:04] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1060.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:03:51] (03PS6) 10Hnowlan: helmfile: add device-analytics configuration, namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/886358 (https://phabricator.wikimedia.org/T320967) [16:05:00] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Muehlenhoff out of all services on: 4 hosts [16:05:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muehlenhoff out of all services on: 4 hosts [16:06:01] PROBLEM - Host elastic1061 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:38] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] kubernetes: add stub values for device-analytics [labs/private] - 10https://gerrit.wikimedia.org/r/895824 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [16:06:40] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on elastic1061.eqiad.wmnet with reason: re-rack [16:06:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on elastic1061.eqiad.wmnet with reason: re-rack [16:07:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5c6bb325-a116-457c-9a58-2cbd8dfcfd42) set by bking@cumin2002 for 1:00:00 on 1 host(s) and their service... [16:08:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1062.eqiad.wmnet [16:08:23] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on elastic1062.eqiad.wmnet with reason: re-rack [16:08:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on elastic1062.eqiad.wmnet with reason: re-rack [16:08:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3fca5960-ee3f-4527-8324-4e3bbd02c3f7) set by bking@cumin2002 for 1:00:00 on 1 host(s) and their service... [16:09:48] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:10:16] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1062.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:11:07] RECOVERY - Host elastic1061 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:13:11] 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10BTullis) [16:13:49] ACKNOWLEDGEMENT - MegaRAID on an-worker1078 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Filed T331544 to replace BBU https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:14:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1062.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:14:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:14:34] (03PS1) 10Volans: Use GenericAlias for type hints [cookbooks] - 10https://gerrit.wikimedia.org/r/895827 [16:14:36] (03PS1) 10Volans: tox: fix setup for pytest on Python 3.10 [cookbooks] - 10https://gerrit.wikimedia.org/r/895828 [16:14:39] (03CR) 10Ladsgroup: [C: 03+1] Drop unused FlaggedRevs threshold level names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [16:14:48] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [16:16:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.532 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:16:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [16:17:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P45532 and previous config saved to /var/cache/conftool/dbconfig/20230308-161727-marostegui.json [16:17:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P45533 and previous config saved to /var/cache/conftool/dbconfig/20230308-161737-marostegui.json [16:17:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:18:55] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [16:19:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1061.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:22:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1060.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:22:58] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [16:23:12] (03PS1) 10Nicolas Fraison: hadoop: add back an-worker1132 [puppet] - 10https://gerrit.wikimedia.org/r/895830 (https://phabricator.wikimedia.org/T330979) [16:23:58] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update location of elastic1061 - bking@cumin2002 - T322082" [16:24:03] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [16:25:04] (03CR) 10Btullis: [C: 03+1] hadoop: add back an-worker1132 [puppet] - 10https://gerrit.wikimedia.org/r/895830 (https://phabricator.wikimedia.org/T330979) (owner: 10Nicolas Fraison) [16:25:09] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1063.eqiad.wmnet'] [16:25:14] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update location of elastic1061 - bking@cumin2002 - T322082" [16:25:15] (03PS5) 10Eevans: data-persistence: alert on elevated sessions store error rate (5xx) [alerts] - 10https://gerrit.wikimedia.org/r/893538 (https://phabricator.wikimedia.org/T327960) [16:26:03] (03CR) 10Volans: [C: 03+2] alertmanager: match also FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/895364 (owner: 10Volans) [16:28:02] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update locatoin of elastic1060 - bking@cumin2002 - T322082" [16:28:05] (03CR) 10Cathal Mooney: [C: 03+2] Return port blocks data for both QFX5120-48Y Netbox device types [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/895725 (https://phabricator.wikimedia.org/T331519) (owner: 10Cathal Mooney) [16:28:32] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [16:29:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update locatoin of elastic1060 - bking@cumin2002 - T322082" [16:29:13] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [16:29:28] (03Merged) 10jenkins-bot: alertmanager: match also FQDN [software/spicerack] - 10https://gerrit.wikimedia.org/r/895364 (owner: 10Volans) [16:29:35] (03CR) 10Nicolas Fraison: [C: 03+2] hadoop: add back an-worker1132 [puppet] - 10https://gerrit.wikimedia.org/r/895830 (https://phabricator.wikimedia.org/T330979) (owner: 10Nicolas Fraison) [16:32:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [16:32:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Maintenance [16:32:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T329260)', diff saved to https://phabricator.wikimedia.org/P45534 and previous config saved to /var/cache/conftool/dbconfig/20230308-163230-marostegui.json [16:32:31] (Traffic bill over quota) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [16:32:35] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:32:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P45535 and previous config saved to /var/cache/conftool/dbconfig/20230308-163240-marostegui.json [16:32:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T329203)', diff saved to https://phabricator.wikimedia.org/P45536 and previous config saved to /var/cache/conftool/dbconfig/20230308-163249-marostegui.json [16:32:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [16:32:55] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [16:33:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [16:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T329203)', diff saved to https://phabricator.wikimedia.org/P45537 and previous config saved to /var/cache/conftool/dbconfig/20230308-163311-marostegui.json [16:34:01] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update location of elastic1062 - bking@cumin2002 - T322082" [16:34:46] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update location of elastic1062 - bking@cumin2002 - T322082" [16:34:53] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [16:35:03] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1063.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:35:07] hm, actually Amir1 I can do this without removing ext-EventLogging.php. let's keep it an just put EventLogging things there! :) [16:35:54] sounds good to me [16:36:33] (03CR) 10Cwhite: [C: 03+1] grafana: serve grafana/grafana-rw from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/895772 (owner: 10Herron) [16:36:41] (03CR) 10Cwhite: [C: 03+1] grafana: add -next suffix to codfw grafana domains names [puppet] - 10https://gerrit.wikimedia.org/r/895773 (owner: 10Herron) [16:37:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:41:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:41:04] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:41:12] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [16:41:20] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [16:42:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:44:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:44:32] (03PS1) 10Ottomata: ext-EventStreamConfig.php - wgEventStreams lives here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) [16:44:34] (03PS1) 10Ottomata: wgEventStreams etc. - Remove duplicate configs after refactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) [16:45:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T329203)', diff saved to https://phabricator.wikimedia.org/P45538 and previous config saved to /var/cache/conftool/dbconfig/20230308-164545-marostegui.json [16:45:52] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [16:45:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.239 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:47:30] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:47:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T328817)', diff saved to https://phabricator.wikimedia.org/P45539 and previous config saved to /var/cache/conftool/dbconfig/20230308-164746-marostegui.json [16:47:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [16:47:51] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:48:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [16:48:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T328817)', diff saved to https://phabricator.wikimedia.org/P45540 and previous config saved to /var/cache/conftool/dbconfig/20230308-164807-marostegui.json [16:49:56] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:51:04] (03CR) 10Ottomata: "Deploy steps:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [16:51:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T329260)', diff saved to https://phabricator.wikimedia.org/P45541 and previous config saved to /var/cache/conftool/dbconfig/20230308-165121-marostegui.json [16:51:28] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:52:02] (03CR) 10Ottomata: "Deploy steps:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [16:52:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1063.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:52:31] (Traffic bill over quota) resolved: (2) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [16:52:44] okay Amir1 , how's [16:52:44] - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/895831 [16:52:44] and then [16:52:44] - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/895832 [16:52:44] look? [16:52:50] deploy steps in comments [16:53:29] let me see [16:55:19] (03CR) 10Alexandros Kosiaris: istio wikikube: Add the proper tolerations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris) [16:55:21] the second one seems to be breaking because the config is changing now https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/1947/console : SUCCESS Please carefully review the change in effective configuration. in 45s (non-voting) [16:55:58] (03CR) 10Ladsgroup: "It should be noop but it's not according to jenkins:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [16:56:58] Amir1: i think that's because patch 1 is not merged [16:57:00] ? [16:58:18] it should be able to handle dependencies [16:58:25] at least AFAIK [16:59:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T328817)', diff saved to https://phabricator.wikimedia.org/P45542 and previous config saved to /var/cache/conftool/dbconfig/20230308-165955-marostegui.json [17:00:01] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:00:40] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10akosiaris) >>! In T216815#8672370, @jnuche wrote: > @akosiaris thanks for the feedback. > > Just to clarify, we can work around the issue currently, but it makes t... [17:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P45543 and previous config saved to /var/cache/conftool/dbconfig/20230308-170051-marostegui.json [17:01:23] (03PS1) 10Sbailey: Enable new Linter UI for namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895833 (https://phabricator.wikimedia.org/T299612) [17:01:57] (03PS11) 10Jbond: redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 [17:02:30] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson @MatthewVernon I have asked chris to help with installs reassigning to him for assistance [17:02:57] maybe i need to write Depends-On? [17:03:30] (03PS2) 10Ottomata: wgEventStreams etc. - Remove duplicate configs after refactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) [17:04:08] a double check would be amazing, maybe something is done wrong, a typo here or there [17:04:15] yeah looking [17:05:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1109.eqiad.wmnet with reason: Maintenance [17:05:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1109.eqiad.wmnet with reason: Maintenance [17:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T318605)', diff saved to https://phabricator.wikimedia.org/P45545 and previous config saved to /var/cache/conftool/dbconfig/20230308-170512-ladsgroup.json [17:05:18] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:05:19] gonna make a new patch with a combined change not intended for merge, to see what the diff is there [17:05:43] (03CR) 10CI reject: [V: 04-1] redfish: add update commands using the patch method [software/spicerack] - 10https://gerrit.wikimedia.org/r/857783 (owner: 10Jbond) [17:06:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P45546 and previous config saved to /var/cache/conftool/dbconfig/20230308-170627-marostegui.json [17:06:56] AHH i see it. [17:08:22] (03PS2) 10Ottomata: ext-EventStreamConfig.php - wgEventStreams lives here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) [17:08:28] (03PS3) 10Ottomata: wgEventStreams etc. - Remove duplicate configs after refactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) [17:08:54] nice [17:09:00] good catch, thank you [17:09:36] (03CR) 10Andrew Bogott: [C: 03+1] replica_cnf_api: skip tool account that don't have a home [puppet] - 10https://gerrit.wikimedia.org/r/895818 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [17:09:47] (03CR) 10Ladsgroup: "now looks good https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/1950/console : FAILURE No ch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [17:09:53] good to go imo [17:09:57] with the steps [17:10:32] k ty [17:10:52] I'm calling it a day for now, it's public holiday here anyway. [17:11:00] ok thanks Amir1 much appreciated. [17:11:13] yw ^_^ [17:11:15] i'm in UK atm, I'll wait til tomororw to deploy [17:11:26] (03CR) 10Ottomata: "Will deploy this tomorrow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [17:11:29] (03CR) 10Ottomata: "Will deploy this tomorrow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [17:12:39] (03CR) 10Sbailey: "Final release for group 2 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895833 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [17:15:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P45547 and previous config saved to /var/cache/conftool/dbconfig/20230308-171501-marostegui.json [17:15:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P45548 and previous config saved to /var/cache/conftool/dbconfig/20230308-171558-marostegui.json [17:21:34] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1064.eqiad.wmnet'] [17:21:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P45549 and previous config saved to /var/cache/conftool/dbconfig/20230308-172134-marostegui.json [17:21:41] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1065.eqiad.wmnet'] [17:26:17] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1066.eqiad.wmnet'] [17:26:41] (03PS5) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) [17:28:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1064.eqiad.wmnet'] [17:29:30] (03PS1) 10Btullis: Bump the version of the spark-operator that we deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/895837 (https://phabricator.wikimedia.org/T318926) [17:30:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P45550 and previous config saved to /var/cache/conftool/dbconfig/20230308-173007-marostegui.json [17:31:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1065.eqiad.wmnet'] [17:31:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T329203)', diff saved to https://phabricator.wikimedia.org/P45551 and previous config saved to /var/cache/conftool/dbconfig/20230308-173104-marostegui.json [17:31:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [17:31:09] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [17:31:16] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1065.eqiad.wmnet'] [17:31:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [17:31:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T329203)', diff saved to https://phabricator.wikimedia.org/P45552 and previous config saved to /var/cache/conftool/dbconfig/20230308-173125-marostegui.json [17:31:51] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1064.eqiad.wmnet'] [17:32:38] (03PS4) 10David Caro: replica_cnf_api: skip tool account that don't have a home [puppet] - 10https://gerrit.wikimedia.org/r/895818 (https://phabricator.wikimedia.org/T303663) [17:32:49] (03CR) 10David Caro: "Updated the message to be more meaningful" [puppet] - 10https://gerrit.wikimedia.org/r/895818 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [17:34:30] (03PS1) 10David Caro: maintain-dbusers: skip tool accounts that are not ready [puppet] - 10https://gerrit.wikimedia.org/r/895838 (https://phabricator.wikimedia.org/T303663) [17:34:52] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1066.eqiad.wmnet'] [17:34:56] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic1066.eqiad.wmnet'] [17:36:07] (03CR) 10David Caro: [C: 03+2] replica_cnf_api: skip tool account that don't have a home [puppet] - 10https://gerrit.wikimedia.org/r/895818 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [17:36:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T329260)', diff saved to https://phabricator.wikimedia.org/P45553 and previous config saved to /var/cache/conftool/dbconfig/20230308-173640-marostegui.json [17:36:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [17:36:45] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:36:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Maintenance [17:37:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T329260)', diff saved to https://phabricator.wikimedia.org/P45554 and previous config saved to /var/cache/conftool/dbconfig/20230308-173701-marostegui.json [17:37:54] (03CR) 10Btullis: [C: 03+2] Bump the version of the spark-operator that we deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/895837 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [17:38:29] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic1065.eqiad.wmnet'] [17:42:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T329203)', diff saved to https://phabricator.wikimedia.org/P45555 and previous config saved to /var/cache/conftool/dbconfig/20230308-174208-marostegui.json [17:42:13] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [17:42:39] (03Merged) 10jenkins-bot: Bump the version of the spark-operator that we deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/895837 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [17:43:26] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1066.eqiad.wmnet'] [17:45:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T318605)', diff saved to https://phabricator.wikimedia.org/P45556 and previous config saved to /var/cache/conftool/dbconfig/20230308-174501-ladsgroup.json [17:45:07] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:45:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T328817)', diff saved to https://phabricator.wikimedia.org/P45557 and previous config saved to /var/cache/conftool/dbconfig/20230308-174514-marostegui.json [17:45:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [17:45:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [17:45:18] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1065.mgmt.eqiad.wmnet with reboot policy GRACEFUL [17:45:19] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:45:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [17:45:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T328817)', diff saved to https://phabricator.wikimedia.org/P45558 and previous config saved to /var/cache/conftool/dbconfig/20230308-174535-marostegui.json [17:46:39] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['elastic1064.eqiad.wmnet'] [17:47:39] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host acmechief-test2001.codfw.wmnet with OS bullseye [17:47:48] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host acmechief-test2001.codfw.wmnet with OS bullseye [17:47:55] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1064.mgmt.eqiad.wmnet with reboot policy GRACEFUL [17:48:16] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host elastic1066.mgmt.eqiad.wmnet with reboot policy GRACEFUL [17:48:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:50:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: (2) Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [17:50:44] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:50:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:51:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:51:33] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:52:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1066.mgmt.eqiad.wmnet with reboot policy GRACEFUL [17:53:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:55:01] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [17:55:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: (2) Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [17:56:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T329260)', diff saved to https://phabricator.wikimedia.org/P45559 and previous config saved to /var/cache/conftool/dbconfig/20230308-175625-marostegui.json [17:56:30] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:56:31] (03CR) 10Herron: [C: 03+2] grafana: serve grafana/grafana-rw from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/895772 (owner: 10Herron) [17:56:33] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief-test2001.codfw.wmnet with reason: host reimage [17:57:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P45560 and previous config saved to /var/cache/conftool/dbconfig/20230308-175714-marostegui.json [17:58:04] !log failing grafana over from codfw to eqiad [17:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T328817)', diff saved to https://phabricator.wikimedia.org/P45561 and previous config saved to /var/cache/conftool/dbconfig/20230308-175810-marostegui.json [17:58:15] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:58:51] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [17:58:55] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [17:59:08] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief-test2001.codfw.wmnet with reason: host reimage [17:59:10] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update location of elastic1066 - bking@cumin2002 - T322082" [17:59:16] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [17:59:30] (03CR) 10Subramanya Sastry: [C: 03+1] Enable new Linter UI for namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895833 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [17:59:52] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: sync [17:59:56] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T1800) [18:00:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P45562 and previous config saved to /var/cache/conftool/dbconfig/20230308-180008-ladsgroup.json [18:00:15] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1435 is CRITICAL: etcd last index (1690293) is outdated compared to the master one (1690296) https://wikitech.wikimedia.org/wiki/Etcd [18:00:18] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: sync [18:00:23] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [18:02:00] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:02:04] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [18:02:07] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1435 is OK: etcd last index (1690296) matches the master one (1690296) https://wikitech.wikimedia.org/wiki/Etcd [18:02:22] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1065.mgmt.eqiad.wmnet with reboot policy GRACEFUL [18:02:36] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:02:47] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [18:04:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1064.mgmt.eqiad.wmnet with reboot policy GRACEFUL [18:05:38] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update location of elastic1066 - bking@cumin2002 - T322082" [18:05:44] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [18:05:57] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update locatoin of elastic1064 - bking@cumin2002 - T322082" [18:09:27] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:09:29] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [18:09:47] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:11:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P45563 and previous config saved to /var/cache/conftool/dbconfig/20230308-181131-marostegui.json [18:12:02] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update locatoin of elastic1064 - bking@cumin2002 - T322082" [18:12:06] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [18:12:20] (03PS1) 10Btullis: Specify docker image and version consistently [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) [18:12:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P45564 and previous config saved to /var/cache/conftool/dbconfig/20230308-181220-marostegui.json [18:13:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P45565 and previous config saved to /var/cache/conftool/dbconfig/20230308-181316-marostegui.json [18:13:35] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update locatoin of elastic1065 - bking@cumin2002 - T322082" [18:13:36] !log bking@cumin2002 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "update locatoin of elastic1065 - bking@cumin2002 - T322082" [18:14:42] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:15:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P45566 and previous config saved to /var/cache/conftool/dbconfig/20230308-181514-ladsgroup.json [18:16:09] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host acmechief-test2001.codfw.wmnet with OS bullseye [18:16:15] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host acmechief-test2001.codfw.wmnet with OS bullseye completed: - acmechief-test2001 (**WARN**) - Downtimed on Icinga/Ale... [18:16:53] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:18:06] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update locatoin of elastic1064-65 - bking@cumin2002 - T322082" [18:18:11] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [18:19:09] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update locatoin of elastic1064-65 - bking@cumin2002 - T322082" [18:20:15] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:26:24] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [18:26:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P45567 and previous config saved to /var/cache/conftool/dbconfig/20230308-182637-marostegui.json [18:27:22] !log bking@cumin2002 unban elastic1060-1066 to finish off T322082 [18:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:27] T322082: Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 [18:27:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T329203)', diff saved to https://phabricator.wikimedia.org/P45568 and previous config saved to /var/cache/conftool/dbconfig/20230308-182726-marostegui.json [18:27:32] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [18:28:17] !log bking@cumin2002 repool elastic1060-1066 to finish off T322082 [18:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P45569 and previous config saved to /var/cache/conftool/dbconfig/20230308-182822-marostegui.json [18:28:35] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:45] (03CR) 10Herron: [C: 03+2] grafana: add -next suffix to codfw grafana domains names [puppet] - 10https://gerrit.wikimedia.org/r/895773 (owner: 10Herron) [18:30:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T318605)', diff saved to https://phabricator.wikimedia.org/P45570 and previous config saved to /var/cache/conftool/dbconfig/20230308-183020-ladsgroup.json [18:30:26] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:33:39] (HelmReleaseBadStatus) firing: Helm release thumbor/main on k8s-staging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-staging&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:36:29] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [18:38:39] (HelmReleaseBadStatus) resolved: Helm release thumbor/main on k8s-staging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-staging&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:41:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T329260)', diff saved to https://phabricator.wikimedia.org/P45571 and previous config saved to /var/cache/conftool/dbconfig/20230308-184143-marostegui.json [18:41:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [18:41:49] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [18:41:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2161.codfw.wmnet with reason: Maintenance [18:42:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T329260)', diff saved to https://phabricator.wikimedia.org/P45572 and previous config saved to /var/cache/conftool/dbconfig/20230308-184204-marostegui.json [18:43:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T328817)', diff saved to https://phabricator.wikimedia.org/P45573 and previous config saved to /var/cache/conftool/dbconfig/20230308-184328-marostegui.json [18:43:33] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:46:15] (03CR) 10Slyngshede: "LGTM, documentation for those who wonder how this works, like myself: https://www.puppet.com/docs/puppet/7/hiera_merging.html#merge_behavi" [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond) [18:47:16] (03PS4) 10Slyngshede: C:idm::deployment fix LDAP configuration [puppet] - 10https://gerrit.wikimedia.org/r/895747 [18:48:20] (03CR) 10Slyngshede: C:idm::deployment fix LDAP configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895747 (owner: 10Slyngshede) [18:50:38] (03PS1) 10Slyngshede: C:idm::deployment ldap servers must be a list. [puppet] - 10https://gerrit.wikimedia.org/r/895844 [18:51:00] (03CR) 10CI reject: [V: 04-1] C:idm::deployment ldap servers must be a list. [puppet] - 10https://gerrit.wikimedia.org/r/895844 (owner: 10Slyngshede) [18:51:33] (03Abandoned) 10Slyngshede: C:idm::deployment fix LDAP configuration [puppet] - 10https://gerrit.wikimedia.org/r/895747 (owner: 10Slyngshede) [18:52:41] (03PS2) 10Slyngshede: C:idm::deployment ldap servers must be a list. [puppet] - 10https://gerrit.wikimedia.org/r/895844 [18:53:32] (03PS1) 10BCornwall: acmechief: Ensure rsync is installed [puppet] - 10https://gerrit.wikimedia.org/r/895845 (https://phabricator.wikimedia.org/T321309) [18:55:45] (03PS2) 10BCornwall: acmechief: Ensure rsync is installed [puppet] - 10https://gerrit.wikimedia.org/r/895845 (https://phabricator.wikimedia.org/T321309) [18:57:04] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40032/console" [puppet] - 10https://gerrit.wikimedia.org/r/895845 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:57:34] (03PS3) 10Slyngshede: Read systems and approval rules from YAML file. [software/bitu] - 10https://gerrit.wikimedia.org/r/895182 [18:59:45] (03CR) 10Ssingh: [C: 03+1] "LGTM! Verified PCC." [puppet] - 10https://gerrit.wikimedia.org/r/895845 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:59:56] (03CR) 10BCornwall: [V: 03+1 C: 03+2] acmechief: Ensure rsync is installed [puppet] - 10https://gerrit.wikimedia.org/r/895845 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:00:05] jeena and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T1900). [19:00:05] jeena and jnuche: Dear deployers, time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T1900). [19:00:40] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10sbassett) [19:00:50] (03PS2) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:01:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T329260)', diff saved to https://phabricator.wikimedia.org/P45574 and previous config saved to /var/cache/conftool/dbconfig/20230308-190106-marostegui.json [19:01:15] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [19:01:32] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895846 (https://phabricator.wikimedia.org/T330204) [19:01:34] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895846 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [19:02:17] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895846 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [19:02:50] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:04:08] (03CR) 10Ahmon Dancy: gitlab_runner: add optional docker registry proxy to runners (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto) [19:08:06] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:09:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for acmechief-test2001.codfw.wmnet [19:09:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief-test2001.codfw.wmnet [19:09:55] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.26 refs T330204 [19:10:00] T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204 [19:11:20] (03PS1) 10BCornwall: acmechief-test: Set acmechief-test2001 as active [puppet] - 10https://gerrit.wikimedia.org/r/895847 (https://phabricator.wikimedia.org/T321309) [19:13:27] (03PS1) 10Cathal Mooney: Add reverse DNS origin entries for newly allocated IPv6 ranges [dns] - 10https://gerrit.wikimedia.org/r/895848 (https://phabricator.wikimedia.org/T327919) [19:13:48] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse entries for new links from CRs to cloudsw1-b1-codfw. - cmooney@cumin1001" [19:14:18] (03CR) 10CI reject: [V: 04-1] Add reverse DNS origin entries for newly allocated IPv6 ranges [dns] - 10https://gerrit.wikimedia.org/r/895848 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [19:14:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse entries for new links from CRs to cloudsw1-b1-codfw. - cmooney@cumin1001" [19:14:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:16:12] !log jhuneidi@deploy2002 Synchronized php: group1 wikis to 1.40.0-wmf.26 refs T330204 (duration: 06m 16s) [19:16:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P45575 and previous config saved to /var/cache/conftool/dbconfig/20230308-191612-marostegui.json [19:16:16] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [19:16:19] T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204 [19:19:54] (03PS3) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:19:57] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40035/console" [puppet] - 10https://gerrit.wikimedia.org/r/895847 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:21:38] (03CR) 10Ssingh: [C: 03+1] acmechief-test: Set acmechief-test2001 as active [puppet] - 10https://gerrit.wikimedia.org/r/895847 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:21:53] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:21:59] (03CR) 10BCornwall: [V: 03+1 C: 03+2] acmechief-test: Set acmechief-test2001 as active [puppet] - 10https://gerrit.wikimedia.org/r/895847 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:23:44] (03PS4) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:25:42] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:28:13] (03PS2) 10Cathal Mooney: Add reverse DNS origin entries for newly allocated ranges. [dns] - 10https://gerrit.wikimedia.org/r/895848 (https://phabricator.wikimedia.org/T327919) [19:29:05] (03CR) 10CI reject: [V: 04-1] Add reverse DNS origin entries for newly allocated ranges. [dns] - 10https://gerrit.wikimedia.org/r/895848 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [19:29:41] (03PS3) 10Cathal Mooney: Add reverse DNS origin entries for newly allocated ranges. [dns] - 10https://gerrit.wikimedia.org/r/895848 (https://phabricator.wikimedia.org/T327919) [19:30:03] (03PS5) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:31:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P45576 and previous config saved to /var/cache/conftool/dbconfig/20230308-193118-marostegui.json [19:31:20] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host acmechief-test1001.eqiad.wmnet with OS bullseye [19:31:27] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host acmechief-test1001.eqiad.wmnet with OS bullseye [19:32:06] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:36:41] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [19:37:33] (03PS6) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:37:57] PROBLEM - cassandra-b CQL 10.192.32.192:9042 on restbase2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [19:38:05] PROBLEM - cassandra-a CQL 10.192.32.191:9042 on restbase2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [19:38:19] PROBLEM - cassandra-c CQL 10.192.32.193:9042 on restbase2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [19:38:19] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:38:51] PROBLEM - SSH on restbase2022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:39:03] PROBLEM - Restbase root url on restbase2022 is CRITICAL: connect to address 10.192.32.190 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [19:39:06] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/895848 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [19:39:31] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:40:48] (03CR) 10Cathal Mooney: [C: 03+2] Add reverse DNS origin entries for newly allocated ranges. [dns] - 10https://gerrit.wikimedia.org/r/895848 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [19:41:46] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief-test1001.eqiad.wmnet with reason: host reimage [19:44:21] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief-test1001.eqiad.wmnet with reason: host reimage [19:46:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T329260)', diff saved to https://phabricator.wikimedia.org/P45577 and previous config saved to /var/cache/conftool/dbconfig/20230308-194625-marostegui.json [19:46:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [19:46:30] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [19:46:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2162.codfw.wmnet with reason: Maintenance [19:46:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T329260)', diff saved to https://phabricator.wikimedia.org/P45578 and previous config saved to /var/cache/conftool/dbconfig/20230308-194646-marostegui.json [19:52:08] (03PS1) 10Gergő Tisza: Leveling up: check if the task type is registered before increasing its edit count [extensions/GrowthExperiments] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895776 (https://phabricator.wikimedia.org/T331524) [19:52:24] (03PS1) 10Gergő Tisza: Leveling up: check if the task type is registered before increasing its edit count [extensions/GrowthExperiments] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895777 (https://phabricator.wikimedia.org/T331524) [19:53:29] (03CR) 10Herron: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/895144 (owner: 10Muehlenhoff) [19:56:19] (03PS7) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:56:39] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10dr0ptp4kt) Approved, and please grant access to Hive via Kerberos auth. [19:57:36] (03PS8) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:58:53] (03PS1) 10Gergő Tisza: maintenance: Adjust query builder to account for no secondary namespaces [extensions/PageTriage] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895778 (https://phabricator.wikimedia.org/T321983) [19:59:42] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:59:52] (03PS1) 10Gergő Tisza: maintenance: Adjust query builder to account for no secondary namespaces [extensions/PageTriage] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895779 (https://phabricator.wikimedia.org/T321983) [20:01:13] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 141 MB (1% inode=45%): /tmp 141 MB (1% inode=45%): /var/tmp 141 MB (1% inode=45%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [20:01:55] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host acmechief-test1001.eqiad.wmnet with OS bullseye [20:02:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host acmechief-test1001.eqiad.wmnet with OS bullseye completed: - acmechief-test1001 (**WARN**) - Downtimed on Icinga/Ale... [20:03:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief-test2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:04:09] (03CR) 10Herron: [C: 03+1] "LGTM, nice color choice" [puppet] - 10https://gerrit.wikimedia.org/r/895713 (owner: 10Filippo Giunchedi) [20:05:38] (03CR) 10CI reject: [V: 04-1] maintenance: Adjust query builder to account for no secondary namespaces [extensions/PageTriage] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895778 (https://phabricator.wikimedia.org/T321983) (owner: 10Gergő Tisza) [20:08:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T329260)', diff saved to https://phabricator.wikimedia.org/P45579 and previous config saved to /var/cache/conftool/dbconfig/20230308-200855-marostegui.json [20:09:01] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [20:18:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief-test2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:18:45] (03CR) 10Gergő Tisza: "recheck" [extensions/PageTriage] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895778 (https://phabricator.wikimedia.org/T321983) (owner: 10Gergő Tisza) [20:18:53] !log power cycle restbase2022 (unresponsive; cannot SSH) [20:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:57] PROBLEM - Host restbase2022 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:39] RECOVERY - Host restbase2022 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [20:21:47] PROBLEM - cassandra-b SSL 10.192.32.192:7001 on restbase2022 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:21:47] PROBLEM - cassandra-a service on restbase2022 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:21:51] RECOVERY - SSH on restbase2022 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:21:51] RECOVERY - Restbase root url on restbase2022 is OK: HTTP OK: HTTP/1.1 200 - 17572 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/RESTBase [20:22:15] PROBLEM - cassandra-c SSL 10.192.32.193:7001 on restbase2022 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:22:19] PROBLEM - cassandra-a SSL 10.192.32.191:7001 on restbase2022 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:22:30] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:22:47] PROBLEM - cassandra-c service on restbase2022 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:23:01] PROBLEM - cassandra-b service on restbase2022 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:24:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P45580 and previous config saved to /var/cache/conftool/dbconfig/20230308-202401-marostegui.json [20:24:26] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host acmechief2001.codfw.wmnet with OS bullseye [20:24:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host acmechief2001.codfw.wmnet with OS bullseye [20:24:37] RECOVERY - cassandra-c service on restbase2022 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:24:51] RECOVERY - cassandra-b service on restbase2022 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:25:25] RECOVERY - cassandra-a service on restbase2022 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:25:53] RECOVERY - cassandra-c SSL 10.192.32.193:7001 on restbase2022 is OK: SSL OK - Certificate restbase2022-c valid until 2023-11-25 11:38:59 +0000 (expires in 261 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:25:57] RECOVERY - cassandra-a SSL 10.192.32.191:7001 on restbase2022 is OK: SSL OK - Certificate restbase2022-a valid until 2023-11-25 11:38:54 +0000 (expires in 261 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:25:59] RECOVERY - cassandra-c CQL 10.192.32.193:9042 on restbase2022 is OK: TCP OK - 0.033 second response time on 10.192.32.193 port 9042 https://phabricator.wikimedia.org/T93886 [20:26:27] (03CR) 10Jbond: "/var/lib/puppet/ssl/crl.pem get overridden when puppet server starts" [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [20:27:05] RECOVERY - cassandra-b CQL 10.192.32.192:9042 on restbase2022 is OK: TCP OK - 0.033 second response time on 10.192.32.192 port 9042 https://phabricator.wikimedia.org/T93886 [20:27:13] RECOVERY - cassandra-b SSL 10.192.32.192:7001 on restbase2022 is OK: SSL OK - Certificate restbase2022-b valid until 2023-11-25 11:38:57 +0000 (expires in 261 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:27:21] RECOVERY - cassandra-a CQL 10.192.32.191:9042 on restbase2022 is OK: TCP OK - 0.033 second response time on 10.192.32.191 port 9042 https://phabricator.wikimedia.org/T93886 [20:30:52] (03CR) 10AOkoth: vrts: copy data to passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895334 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [20:31:06] (03CR) 10AOkoth: [C: 03+2] vrts: copy data to passive host [puppet] - 10https://gerrit.wikimedia.org/r/895334 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [20:36:23] (03CR) 10Jbond: puppetserver: (WIP) add basic class for puppert server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [20:36:34] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief2001.codfw.wmnet with reason: host reimage [20:36:52] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P45581 and previous config saved to /var/cache/conftool/dbconfig/20230308-203907-marostegui.json [20:39:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief2001.codfw.wmnet with reason: host reimage [20:41:42] !log deploy2002 - systemctl restart keyholder-proxy.service to fix T331568 - after this SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -i /etc/keyholder.d/deploy_jenkins -l deploy-jenkins releases1002.eqiad.wmnet works [20:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:47] T331568: scap can not ssh with keyholder on deploy2002 - https://phabricator.wikimedia.org/T331568 [20:43:37] (03CR) 10Herron: "I can see the use in ensuring a consistent logrotate configs across the fleet, but to that end I'd prefer to stick to a consistent rotate " [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [20:46:19] (03PS1) 10Jforrester: [BETA CLUSTER] Log WikiLambda events into logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895858 [20:48:28] jeena: OK for me to slip out a Beta-only config patch? [20:48:56] James_F: yup all clear [20:49:00] Awesome. [20:49:24] Oh, is deployment.codfw.wmnet the current host or deployment.eqiad.wmnet? I saw the rights concern but wanted to check. [20:49:24] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:27] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Log WikiLambda events into logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895858 (owner: 10Jforrester) [20:49:57] deploy2002 it looks like? Cool. [20:50:14] (03Merged) 10jenkins-bot: [BETA CLUSTER] Log WikiLambda events into logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895858 (owner: 10Jforrester) [20:50:37] All done. [20:51:01] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host acmechief2001.codfw.wmnet with OS bullseye [20:51:07] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host acmechief2001.codfw.wmnet with OS bullseye completed: - acmechief2001 (**PASS**) - Downtimed on Icinga/Alertmanager... [20:51:43] (03PS1) 10BCornwall: acmechief: Set acmechief2001 as active [puppet] - 10https://gerrit.wikimedia.org/r/895860 (https://phabricator.wikimedia.org/T321309) [20:54:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T329260)', diff saved to https://phabricator.wikimedia.org/P45582 and previous config saved to /var/cache/conftool/dbconfig/20230308-205414-marostegui.json [20:54:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [20:54:19] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [20:54:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [20:54:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T329260)', diff saved to https://phabricator.wikimedia.org/P45583 and previous config saved to /var/cache/conftool/dbconfig/20230308-205435-marostegui.json [20:58:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T2100). [21:00:04] kemayo, sbailey, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:18] I am here :-) [21:00:18] I can deploy [21:00:19] 👋🏻 [21:00:29] 10SRE, 10ops-codfw, 10Wikimedia-Incident: 2022-12-15 codfw worker exhaustion - https://phabricator.wikimedia.org/T328353 (10lmata) 05Open→03Resolved a:03lmata @Papaul: thanks for following up, I'll be resolving it. AFAIK we reviewed this and a private doc exists with notes from the event. [21:00:31] o/ [21:00:46] Oh boy this is a big one. [21:00:54] (all yours! :D) [21:00:56] Yeah, it's shot up since this morning [21:01:12] I can deploy my patches if you prefer [21:01:45] tgr if you wouldn't mind doing that after I finish up the others, I would appreciate it. [21:01:51] (also, will start merging the GrowthExperiment ones as the CI there takes forever) [21:02:03] (03CR) 10Gergő Tisza: [C: 03+2] Leveling up: check if the task type is registered before increasing its edit count [extensions/GrowthExperiments] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895777 (https://phabricator.wikimedia.org/T331524) (owner: 10Gergő Tisza) [21:02:09] (03CR) 10Gergő Tisza: [C: 03+2] Leveling up: check if the task type is registered before increasing its edit count [extensions/GrowthExperiments] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895776 (https://phabricator.wikimedia.org/T331524) (owner: 10Gergő Tisza) [21:02:52] Kemayo: are you comfortable if I deploy your three together? [21:02:59] Sure, that's fine. [21:03:02] (looks like wikibugs-bot i18n got a downgrade) [21:06:39] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10lmata) The team chatted about this during our weekly meeting today; we're still investigating; this might take a couple of weeks to queue as we've some short-term sprint-w... [21:07:28] !log hashar@deploy2002 Started deploy [releng/jenkins-deploy@0e465ac] (releasing): (no justification provided) [21:08:29] !log hashar@deploy2002 Finished deploy [releng/jenkins-deploy@0e465ac] (releasing): (no justification provided) (duration: 01m 01s) [21:10:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888804 (https://phabricator.wikimedia.org/T314588) (owner: 10Esanders) [21:10:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895375 (https://phabricator.wikimedia.org/T328942) (owner: 10DLynch) [21:10:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895297 (https://phabricator.wikimedia.org/T267444) (owner: 10DLynch) [21:11:20] jouncebot: now [21:11:20] For the next 0 hour(s) and 48 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230308T2100) [21:11:25] (03Merged) 10jenkins-bot: Enable history page visual diffs everywhere except Wikipedias and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888804 (https://phabricator.wikimedia.org/T314588) (owner: 10Esanders) [21:11:58] (03PS2) 10Stef Dunlap: Release DiscussionTools on mobile on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895375 (https://phabricator.wikimedia.org/T328942) (owner: 10DLynch) [21:12:13] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895375 (https://phabricator.wikimedia.org/T328942) (owner: 10DLynch) [21:12:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895297 (https://phabricator.wikimedia.org/T267444) (owner: 10DLynch) [21:13:02] (03Merged) 10jenkins-bot: Release DiscussionTools on mobile on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895375 (https://phabricator.wikimedia.org/T328942) (owner: 10DLynch) [21:19:13] Oops, forgot to... [21:19:14] !log start UTC-late backport window [21:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:41] (03Merged) 10jenkins-bot: Leveling up: check if the task type is registered before increasing its edit count [extensions/GrowthExperiments] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895777 (https://phabricator.wikimedia.org/T331524) (owner: 10Gergő Tisza) [21:21:43] (03Merged) 10jenkins-bot: Leveling up: check if the task type is registered before increasing its edit count [extensions/GrowthExperiments] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895776 (https://phabricator.wikimedia.org/T331524) (owner: 10Gergő Tisza) [21:21:48] kindrobot: fun I thought you were an actual robot account used by `scap backport` :D [21:22:30] Hehehe [21:22:32] :P [21:22:41] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@3419b7d]: test deploy after deployment fix [21:22:46] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@3419b7d]: test deploy after deployment fix (duration: 00m 05s) [21:22:51] I mean, in a way... [21:23:12] Truly, are we not all meat robots? [21:23:21] we should rename scap to something like "human driven robot assisted deployment" [21:23:32] (no clue whether the acronym would make any sense) [21:24:31] we should replace scap with container builds and never speak of the old process again [21:24:43] yeah that is the long term goal eventually [21:24:45] * bd808 earned the right to hate scap [21:25:25] then when it works, it works! ™ [21:26:38] But if we replace them with container builds... when will I get to hang out with y'all? [21:27:29] (03Merged) 10jenkins-bot: Switch order of "Add topic" and language dropdown [skins/Vector] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895297 (https://phabricator.wikimedia.org/T267444) (owner: 10DLynch) [21:29:06] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:888804|Enable history page visual diffs everywhere except Wikipedias and Wiktionaries (T314588)]], [[gerrit:895375|Release DiscussionTools on mobile on enwiki (T328942)]], [[gerrit:895297|Switch order of "Add topic" and language dropdown (T267444)]] [21:29:14] T328942: [Config Change] Enable all DiscussionTools as default-on features at Phase 2 wikis (mobile) - https://phabricator.wikimedia.org/T328942 [21:29:14] T267444: Make the affordance(s) for adding a new topic easier to identify and access (Vector 2022) - https://phabricator.wikimedia.org/T267444 [21:29:14] T314588: Launch visual diffs on history pages out of beta and provide it to all users - https://phabricator.wikimedia.org/T314588 [21:30:46] !log kindrobot@deploy2002 kemayo and kindrobot and esanders: Backport for [[gerrit:888804|Enable history page visual diffs everywhere except Wikipedias and Wiktionaries (T314588)]], [[gerrit:895375|Release DiscussionTools on mobile on enwiki (T328942)]], [[gerrit:895297|Switch order of "Add topic" and language dropdown (T267444)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqi [21:30:46] ad.wmnet, mwdebug2002.codfw.wmnet [21:31:16] (03CR) 10JHathaway: "thanks again for reviewing joe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [21:31:27] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [21:31:43] kindrobot: you would be able to hangout in #wikimedia-k8s-broke :) [21:31:57] (03PS2) 10JHathaway: Run kubeconform on supported versions of charts & envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) [21:31:59] (03PS2) 10JHathaway: jaeger: add fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/893076 (https://phabricator.wikimedia.org/T320554) [21:32:01] (03PS8) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) [21:32:02] more seriously, I believe we will still need some level of synchronization [21:32:12] (03CR) 10CI reject: [V: 04-1] Run kubeconform on supported versions of charts & envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [21:32:14] (03CR) 10CI reject: [V: 04-1] jaeger: add fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/893076 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [21:32:28] (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [21:32:42] Kemayo: you're changes are on the test servers, can you confirm them? [21:32:56] Sure, give me just a second [21:33:49] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) `02592: FAILED: internal_api_error_UploadChunkFileException: [f70476d1-1ac6-44a8-8ffb-a134bea401d1... [21:35:55] kindrobot: Looks good [21:36:13] Great continuing the sync. [21:37:57] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [21:39:23] bd808: my connect cut off from the deployment box, but I was at the "Continue with sync? (y/n):" prompt. [21:39:33] Will I accomplish finishing it with "scap sync-world" ? [21:41:17] kindrobot: I don't know. dancy might though [21:41:32] PROBLEM - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) is CRITICAL: Test Get value for key returned the unexpected status 500 (expecting: 200): /sessions/v1/{key} (Store value for key) is CRITICAL: Test Store value for key returned the unexpected status 500 (expecting: 201) https://www.mediawiki.org/wiki/Kask [21:41:45] * dancy thinks [21:41:53] * kindrobot is looking through documentation [21:42:00] It will definitely require a 'y' input to proceed. [21:42:28] I think the process stopped when my connection dropped (I forgot to start tmux) [21:42:31] I do see that there is still a scap sync-world process runnig [21:42:38] Oh, huh [21:43:08] (03PS1) 10Bking: search-airflow: grant admin rights to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/895865 (https://phabricator.wikimedia.org/T327970) [21:43:55] My advice is to log back in, kill pid 13150 and start the operation over. [21:44:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/895865 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [21:44:11] OK. I will do that. [21:44:42] RECOVERY - MegaRAID on an-worker1078 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:44:52] (03CR) 10Ebernhardson: [C: 03+1] search-airflow: grant admin rights to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/895865 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [21:45:21] Thanks for the ping bd808 [21:45:43] Thanks for you help :) [21:46:51] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:895297|Switch order of "Add topic" and language dropdown (T267444)]], [[gerrit:895375|Release DiscussionTools on mobile on enwiki (T328942)]], [[gerrit:888804|Enable history page visual diffs everywhere except Wikipedias and Wiktionaries (T314588)]] [21:46:53] (SessionStoreErrorRateHigh) firing: Session storage error rates (5xx) in eqiad are elevated #page - TODO - https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreErrorRateHigh [21:46:58] T328942: [Config Change] Enable all DiscussionTools as default-on features at Phase 2 wikis (mobile) - https://phabricator.wikimedia.org/T328942 [21:46:59] T267444: Make the affordance(s) for adding a new topic easier to identify and access (Vector 2022) - https://phabricator.wikimedia.org/T267444 [21:46:59] T314588: Launch visual diffs on history pages out of beta and provide it to all users - https://phabricator.wikimedia.org/T314588 [21:47:01] hm [21:47:14] OK, I've kicked off the backport again (should go faster because it's already merged) [21:47:15] here [21:47:22] that's me! [21:47:24] ^^^ [21:47:47] urandom: need help? [21:48:21] no no it's fine [21:48:32] but...how do I silence that now 🤔 [21:48:45] !log kindrobot@deploy2002 kemayo and kindrobot and esanders: Backport for [[gerrit:895297|Switch order of "Add topic" and language dropdown (T267444)]], [[gerrit:895375|Release DiscussionTools on mobile on enwiki (T328942)]], [[gerrit:888804|Enable history page visual diffs everywhere except Wikipedias and Wiktionaries (T314588)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.cod [21:48:45] fw.wmnet, mwdebug1001.eqiad.wmnet [21:48:59] I should have asked that before, even if it is nice to know that it works [21:50:25] I set about to generate some errors, but failed to take into account that I'd also just setup an alert on errors [21:50:46] RECOVERY - Sessionstore eqiad on sessionstore.svc.eqiad.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [21:50:50] urandom: one way is to respond to the SMS with that "magic number" to ACK it, another is to go to alerts.wikimedia.org and click in the web UI, yet another is to use the "downtime" cookbook for a limited time [21:50:54] so good news, the alert works, bad news... I need to test without rattling everyone's cage [21:50:54] urandom: on alerts.wikimedia.org, there's a struck-out bell icon upper right to "create a new silence" [21:51:21] it's not super intuitive, but basically you have to put in the matching fields to suppress that alert [21:51:36] !incidents [21:51:36] You're not allowed to perform this action. [21:51:44] oh >:( [21:51:53] (SessionStoreErrorRateHigh) resolved: Session storage error rates (5xx) in eqiad are elevated #page - TODO - https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreErrorRateHigh [21:51:58] !incidents [21:51:58] 3466 (RESOLVED) SessionStoreErrorRateHigh (eqiad) [21:52:04] (iirc you can ack via that too?) [21:52:31] yes, for acking/resolving we have a bunch of options, but I assume what urandom's looking for is a way to pre-silence for an extended period, to avoid paging [21:52:41] I am, yeah [21:52:54] and really for just one datacenter [21:53:00] (ah) [21:53:09] if you start with clicking the specific alerts.wm.o link from the page/alert text above [21:53:23] alerts will already be set to filter for that alert, and it will carry over to the "New silence" dialogue, which helps [21:53:57] https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements has some details but is basically walking you through what bblack is suggesting [21:54:04] and you can use the faint "+" below "alertname" in that dialogue, to add more variables, like "site" to match just eqiad [21:54:40] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:895297|Switch order of "Add topic" and language dropdown (T267444)]], [[gerrit:895375|Release DiscussionTools on mobile on enwiki (T328942)]], [[gerrit:888804|Enable history page visual diffs everywhere except Wikipedias and Wiktionaries (T314588)]] (duration: 07m 49s) [21:54:48] T328942: [Config Change] Enable all DiscussionTools as default-on features at Phase 2 wikis (mobile) - https://phabricator.wikimedia.org/T328942 [21:54:48] T267444: Make the affordance(s) for adding a new topic easier to identify and access (Vector 2022) - https://phabricator.wikimedia.org/T267444 [21:54:48] T314588: Launch visual diffs on history pages out of beta and provide it to all users - https://phabricator.wikimedia.org/T314588 [21:54:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:55:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T329260)', diff saved to https://phabricator.wikimedia.org/P45584 and previous config saved to /var/cache/conftool/dbconfig/20230308-215500-marostegui.json [21:55:06] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [21:55:17] sbailey: are you ready? [21:55:23] yes [21:55:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895833 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:56:00] (03PS2) 10Stef Dunlap: Enable new Linter UI for namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895833 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:56:15] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895833 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:57:03] (03Merged) 10jenkins-bot: Enable new Linter UI for namespace, tag and template for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895833 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [21:57:27] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:895833|Enable new Linter UI for namespace, tag and template for all wikis (T299612)]] [21:57:32] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [21:57:45] 10SRE, 10ops-codfw, 10Wikimedia-Incident: 2022-12-15 codfw worker exhaustion - https://phabricator.wikimedia.org/T328353 (10Papaul) Thank you. [21:58:34] 10SRE, 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2093.codfw.wmnet - https://phabricator.wikimedia.org/T330827 (10Papaul) a:03Jhancock.wm [21:59:09] !log kindrobot@deploy2002 sbailey and kindrobot: Backport for [[gerrit:895833|Enable new Linter UI for namespace, tag and template for all wikis (T299612)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:59:33] sbailey: on test servers, can you confirm?? [21:59:44] checking [21:59:48] (03PS1) 10Dzahn: admin: remove ssh key for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/895869 (https://phabricator.wikimedia.org/T331554) [21:59:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:00:25] looks good [22:01:28] Great, syncing now. [22:01:29] kindrobot: verified on en.wikipedia and de.wikipedia debug servers [22:02:05] may you poke when the backport window has completed? I will restart Jenkins CI :) [22:02:11] (no rush) [22:04:48] Sure, I'll be handing off to tgr in a moment. Do you copy hashar 's message tgr? [22:05:10] ack [22:07:03] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:895833|Enable new Linter UI for namespace, tag and template for all wikis (T299612)]] (duration: 09m 36s) [22:07:08] T299612: Add namespace column and index to table - https://phabricator.wikimedia.org/T299612 [22:08:09] Ok, I've finished kemayo and sbailey's backports. Are you good to take it from here tgr ? [22:08:28] Thanks kindrobot :-) [22:08:28] yes, thanks kindrobot [22:08:42] Thank you everyone. :) [22:09:02] !log hand off backport window UTC late to tgr for self-service [22:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:14] (03CR) 10Dzahn: [C: 03+2] admin: remove ssh key for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/895869 (https://phabricator.wikimedia.org/T331554) (owner: 10Dzahn) [22:09:49] (03PS3) 10JHathaway: Run kubeconform on supported versions of charts & envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) [22:09:51] (03PS3) 10JHathaway: jaeger: add fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/893076 (https://phabricator.wikimedia.org/T320554) [22:09:53] (03PS9) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) [22:10:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P45585 and previous config saved to /var/cache/conftool/dbconfig/20230308-221006-marostegui.json [22:12:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895778 (https://phabricator.wikimedia.org/T321983) (owner: 10Gergő Tisza) [22:12:13] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [22:12:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/PageTriage] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895779 (https://phabricator.wikimedia.org/T321983) (owner: 10Gergő Tisza) [22:12:30] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10Dzahn) Normally these are handled by the SRE on clinic duty but since it's late in Europe and to be on the safe side I just revoked the existing key and ran pupp... [22:16:54] (03CR) 10JHathaway: [C: 03+2] Run kubeconform on supported versions of charts & envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [22:19:49] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10Dzahn) 05Open→03In progress [22:20:03] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [22:20:22] (03CR) 10JHathaway: [V: 03+2 C: 03+2] Run kubeconform on supported versions of charts & envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/893075 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [22:20:55] (03CR) 10JHathaway: [C: 03+2] jaeger: add fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/893076 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [22:20:59] (03Merged) 10jenkins-bot: maintenance: Adjust query builder to account for no secondary namespaces [extensions/PageTriage] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895778 (https://phabricator.wikimedia.org/T321983) (owner: 10Gergő Tisza) [22:21:06] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host sessionstore1001.eqiad.wmnet [22:21:07] (03Merged) 10jenkins-bot: maintenance: Adjust query builder to account for no secondary namespaces [extensions/PageTriage] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895779 (https://phabricator.wikimedia.org/T321983) (owner: 10Gergő Tisza) [22:21:26] !log tgr@deploy2002 Started scap: Backport for [[gerrit:895778|maintenance: Adjust query builder to account for no secondary namespaces (T321983 T331412)]], [[gerrit:895779|maintenance: Adjust query builder to account for no secondary namespaces (T321983 T331412)]] [22:21:32] T321983: Drop support for userspace patrolling - https://phabricator.wikimedia.org/T321983 [22:21:32] T331412: InvalidArgumentException from line 234 of /srv/mediawiki/php-1.40.0-wmf.25/includes/libs/rdbms/platform/SQLPlatform.php: Wikimedia\Rdbms\Platform\SQLPlatform::makeList: empty input for field page_namespace - https://phabricator.wikimedia.org/T331412 [22:23:31] !log tgr@deploy2002 tgr: Backport for [[gerrit:895778|maintenance: Adjust query builder to account for no secondary namespaces (T321983 T331412)]], [[gerrit:895779|maintenance: Adjust query builder to account for no secondary namespaces (T321983 T331412)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [22:23:31] (03PS1) 10Volans: doc: dynamically set copyright year to current [software/spicerack] - 10https://gerrit.wikimedia.org/r/895872 [22:25:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P45586 and previous config saved to /var/cache/conftool/dbconfig/20230308-222512-marostegui.json [22:27:11] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sessionstore1001.eqiad.wmnet [22:28:34] hashar: still deploying but everything that needed to be merged on gerrit is merged [22:29:09] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:895778|maintenance: Adjust query builder to account for no secondary namespaces (T321983 T331412)]], [[gerrit:895779|maintenance: Adjust query builder to account for no secondary namespaces (T321983 T331412)]] (duration: 07m 43s) [22:29:16] T321983: Drop support for userspace patrolling - https://phabricator.wikimedia.org/T321983 [22:29:16] T331412: InvalidArgumentException from line 234 of /srv/mediawiki/php-1.40.0-wmf.25/includes/libs/rdbms/platform/SQLPlatform.php: Wikimedia\Rdbms\Platform\SQLPlatform::makeList: empty input for field page_namespace - https://phabricator.wikimedia.org/T331412 [22:29:39] tgr: I will hold until you have finished. In case a revert is needed or whatever :) [22:29:41] don't worry [22:29:48] thanks for the update! [22:30:45] !log tgr@deploy2002 Started scap: Backport for [[gerrit:895776|Leveling up: check if the task type is registered before increasing its edit count (T331524)]], [[gerrit:895777|Leveling up: check if the task type is registered before increasing its edit count (T331524)]] [22:30:49] T331524: PHP Notice: Undefined index: link-recommendation - https://phabricator.wikimedia.org/T331524 [22:32:27] !log tgr@deploy2002 tgr: Backport for [[gerrit:895776|Leveling up: check if the task type is registered before increasing its edit count (T331524)]], [[gerrit:895777|Leveling up: check if the task type is registered before increasing its edit count (T331524)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [22:39:17] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:895776|Leveling up: check if the task type is registered before increasing its edit count (T331524)]], [[gerrit:895777|Leveling up: check if the task type is registered before increasing its edit count (T331524)]] (duration: 08m 31s) [22:39:22] T331524: PHP Notice: Undefined index: link-recommendation - https://phabricator.wikimedia.org/T331524 [22:40:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T329260)', diff saved to https://phabricator.wikimedia.org/P45587 and previous config saved to /var/cache/conftool/dbconfig/20230308-224018-marostegui.json [22:40:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [22:40:24] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [22:40:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Maintenance [22:40:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [22:40:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [22:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T329260)', diff saved to https://phabricator.wikimedia.org/P45588 and previous config saved to /var/cache/conftool/dbconfig/20230308-224044-marostegui.json [22:42:21] !log UTC late deploys done [22:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:30] hashar: done, thanks for the patience [22:42:56] (and sorry for packing so many commits into this window. They were all production breakages though.) [22:42:59] tgr: thank you for your tireless work on fixing the wikis! [22:43:31] the good thing is now west coast is all having lunch or having a post lunch nap so it is an excellent time to stop CI :] [22:43:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [22:44:23] !log Upgrading CI Jenkins [22:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10bking) This work is complete! Thanks @Jclark-ctr and everyone else who helped. Moving to "needs reporting" on the discovery-search board... [22:48:51] CI Jenkins updated ;) [22:54:28] looks like it is all working well [22:56:52] (03CR) 10Ottomata: [C: 03+1] search-airflow: grant admin rights to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/895865 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [22:58:55] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10BCornwall) @BBlack / @Vgutierrez is https://gerrit.wikimedia.org/r/c/operations/dns/+/793728 something that you're am... [22:59:01] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10BCornwall) 05Open→03Stalled [22:59:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T329260)', diff saved to https://phabricator.wikimedia.org/P45589 and previous config saved to /var/cache/conftool/dbconfig/20230308-225943-marostegui.json [22:59:50] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [23:05:31] (03PS1) 10Ryan Kemper: elastic: Incr per-node shard recovery thru-put cap [puppet] - 10https://gerrit.wikimedia.org/r/895874 (https://phabricator.wikimedia.org/T317816) [23:07:20] (03CR) 10Bking: [C: 03+1] elastic: Incr per-node shard recovery thru-put cap [puppet] - 10https://gerrit.wikimedia.org/r/895874 (https://phabricator.wikimedia.org/T317816) (owner: 10Ryan Kemper) [23:07:27] (03CR) 10Ryan Kemper: "Separately from this, we may want to later take a look at if indices.recovery.max_concurrent_file_chunks should be increased from its defa" [puppet] - 10https://gerrit.wikimedia.org/r/895874 (https://phabricator.wikimedia.org/T317816) (owner: 10Ryan Kemper) [23:10:28] (03CR) 10Bking: [C: 03+2] search-airflow: grant admin rights to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/895865 (https://phabricator.wikimedia.org/T327970) (owner: 10Bking) [23:11:27] (03PS1) 10BCornwall: varnish: Change systemd units Requires to BindsTo [puppet] - 10https://gerrit.wikimedia.org/r/895875 (https://phabricator.wikimedia.org/T284555) [23:13:00] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40036/console" [puppet] - 10https://gerrit.wikimedia.org/r/895875 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [23:13:18] 10SRE, 10Traffic-Icebox, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) 05Open→03In progress a:03BCornwall [23:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P45590 and previous config saved to /var/cache/conftool/dbconfig/20230308-231449-marostegui.json [23:15:12] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40037/console" [puppet] - 10https://gerrit.wikimedia.org/r/895875 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [23:16:25] (03PS10) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) [23:16:27] (03PS1) 10BCornwall: docker-service-shim: change Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895877 (https://phabricator.wikimedia.org/T284555) [23:17:31] (03PS12) 10EoghanGaffney: Add the aphlict config on aphlict2001.codfw [puppet] - 10https://gerrit.wikimedia.org/r/895240 (https://phabricator.wikimedia.org/T322369) [23:19:49] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10Jclark-ctr) @MatthewVernon Will you be available for the swap tomorrow? [23:20:46] (03PS2) 10BCornwall: docker-service-shim: change Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895877 (https://phabricator.wikimedia.org/T284555) [23:20:48] (03PS1) 10BCornwall: ats-mtail: Change systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895878 (https://phabricator.wikimedia.org/T284555) [23:20:56] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10sbassett) >>! In T331554#8678099, @Dzahn wrote: > Normally these are handled by the SRE on clinic duty but since it's late in Europe and to be on the safe side I... [23:21:22] (03CR) 10CI reject: [V: 04-1] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [23:25:40] 10SRE, 10Traffic-Icebox, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) [23:26:28] 10SRE, 10Traffic-Icebox, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) Removed `ircecho` from the list as it had `Requires=network.target`, which... [23:28:26] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 103 MB (0% inode=45%): /tmp 103 MB (0% inode=45%): /var/tmp 103 MB (0% inode=45%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [23:29:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P45591 and previous config saved to /var/cache/conftool/dbconfig/20230308-232956-marostegui.json [23:35:48] 10SRE, 10SRE-Access-Requests, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10Dzahn) Ok great, thanks for confirming that. Then I will just leave this open until tomorrow. Cheers [23:37:37] (03PS1) 10JHathaway: aux-k8s: add dummy logs-api password [labs/private] - 10https://gerrit.wikimedia.org/r/895880 [23:38:18] (03PS1) 10Zabe: noc: Publicly expose new setting files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895883 (https://phabricator.wikimedia.org/T308932) [23:39:08] (03PS1) 10BCornwall: codesearch: Change systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) [23:39:10] (03PS1) 10BCornwall: keyholder-proxy: systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895885 (https://phabricator.wikimedia.org/T284555) [23:39:12] (03PS1) 10BCornwall: fifo-log-demux: systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895886 (https://phabricator.wikimedia.org/T284555) [23:40:11] jouncebot: nowandnext [23:40:12] No deployments scheduled for the next 7 hour(s) and 19 minute(s) [23:40:12] In 7 hour(s) and 19 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0700) [23:40:12] In 7 hour(s) and 19 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0700) [23:40:20] (03CR) 10Zabe: [C: 03+2] noc: Publicly expose new setting files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895883 (https://phabricator.wikimedia.org/T308932) (owner: 10Zabe) [23:41:00] (03CR) 10JHathaway: [C: 03+2] aux-k8s: add dummy logs-api password [labs/private] - 10https://gerrit.wikimedia.org/r/895880 (owner: 10JHathaway) [23:41:04] (03Merged) 10jenkins-bot: noc: Publicly expose new setting files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895883 (https://phabricator.wikimedia.org/T308932) (owner: 10Zabe) [23:41:31] (03CR) 10JHathaway: [V: 03+2 C: 03+2] aux-k8s: add dummy logs-api password [labs/private] - 10https://gerrit.wikimedia.org/r/895880 (owner: 10JHathaway) [23:42:15] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@29f73a4]: update virtualenv entry_points to use relative paths [23:42:30] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@29f73a4]: update virtualenv entry_points to use relative paths (duration: 00m 14s) [23:42:43] !log zabe@deploy2002 Started scap: T308932 [23:42:47] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [23:45:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T329260)', diff saved to https://phabricator.wikimedia.org/P45592 and previous config saved to /var/cache/conftool/dbconfig/20230308-234502-marostegui.json [23:45:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [23:45:08] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [23:45:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Maintenance [23:45:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T329260)', diff saved to https://phabricator.wikimedia.org/P45593 and previous config saved to /var/cache/conftool/dbconfig/20230308-234534-marostegui.json [23:49:59] !log zabe@deploy2002 Finished scap: T308932 (duration: 07m 15s) [23:50:03] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [23:52:50] 10SRE, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) [23:54:37] Thanks zabe [23:55:14] yw