[00:03:52] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T329260)', diff saved to https://phabricator.wikimedia.org/P45594 and previous config saved to /var/cache/conftool/dbconfig/20230309-000651-marostegui.json [00:06:57] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [00:10:04] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [00:11:36] (03CR) 10Dzahn: "Is there a problem if hound is running even though hound_proxy is not running? I don't really know much about it and to test this change o" [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [00:21:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P45596 and previous config saved to /var/cache/conftool/dbconfig/20230309-002157-marostegui.json [00:24:51] (03CR) 10Dzahn: peopleweb: ensure each user automatically gets a public_html dir (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [00:26:08] (03PS3) 10Dzahn: peopleweb: ensure each user automatically gets a public_html dir [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091) [00:37:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P45597 and previous config saved to /var/cache/conftool/dbconfig/20230309-003703-marostegui.json [00:52:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T329260)', diff saved to https://phabricator.wikimedia.org/P45598 and previous config saved to /var/cache/conftool/dbconfig/20230309-005210-marostegui.json [00:52:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [00:52:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [00:52:16] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [00:52:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45599 and previous config saved to /var/cache/conftool/dbconfig/20230309-005220-marostegui.json [00:59:31] (03PS1) 10BryanDavis: striker: Bump container version to 2023-03-09-005633-production [puppet] - 10https://gerrit.wikimedia.org/r/895892 (https://phabricator.wikimedia.org/T330421) [01:02:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10DMartin-WMF) @MatthewVernon - Per my 1:1 discussion with @dr0ptp4kt earlier today, it would be good for me to have kerberos access. Apologies for not mentioning that... [01:03:16] (03CR) 10BryanDavis: [C: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/895892/40040/" [puppet] - 10https://gerrit.wikimedia.org/r/895892 (https://phabricator.wikimedia.org/T330421) (owner: 10BryanDavis) [01:05:45] (03CR) 10BryanDavis: [C: 03+1] "This should be safe to merge and deploy whenever you have time on Thursday andrewbogott. There is a manual step I will need to do after th" [puppet] - 10https://gerrit.wikimedia.org/r/895892 (https://phabricator.wikimedia.org/T330421) (owner: 10BryanDavis) [01:09:23] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/895144 (owner: 10Muehlenhoff) [01:09:55] (03CR) 10Cwhite: [C: 03+1] alertmanager: highlight 'source' label [puppet] - 10https://gerrit.wikimedia.org/r/895713 (owner: 10Filippo Giunchedi) [01:12:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45600 and previous config saved to /var/cache/conftool/dbconfig/20230309-011251-marostegui.json [01:12:57] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [01:13:47] (03CR) 10Cwhite: "SGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/895719 (owner: 10Filippo Giunchedi) [01:18:25] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@558da74]: correct eventgate datacenter partitioning in sensors [01:18:39] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@558da74]: correct eventgate datacenter partitioning in sensors (duration: 00m 13s) [01:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P45601 and previous config saved to /var/cache/conftool/dbconfig/20230309-012757-marostegui.json [01:34:17] (03CR) 10Ssingh: [C: 03+1] fifo-log-demux: systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895886 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [01:34:34] (03CR) 10Ssingh: [C: 03+1] ats-mtail: Change systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895878 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [01:36:36] (03CR) 10Ssingh: [C: 03+1] "Looks good! I think out of an abundance of caution, when we merge this during working hours even, we should disable Puppet on A:cp and the" [puppet] - 10https://gerrit.wikimedia.org/r/895875 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [01:39:56] (03CR) 10Ssingh: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/895875/40041/" [puppet] - 10https://gerrit.wikimedia.org/r/895875 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [01:43:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P45602 and previous config saved to /var/cache/conftool/dbconfig/20230309-014303-marostegui.json [01:58:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45603 and previous config saved to /var/cache/conftool/dbconfig/20230309-015810-marostegui.json [01:58:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [01:58:16] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [01:58:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [01:58:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45604 and previous config saved to /var/cache/conftool/dbconfig/20230309-015831-marostegui.json [02:04:48] (03PS1) 10Ssingh: dns::auth: deprecate role and update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) [02:05:46] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40042/console" [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:38] (03CR) 10Ssingh: [V: 03+1] "DO NOT MERGE until after authdns[12]001 deprecation." [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [02:13:57] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data point - https://phabricator.wikimedia.org/T324675 (10Etonkovidova) 05Open→03R... [02:19:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45606 and previous config saved to /var/cache/conftool/dbconfig/20230309-021905-marostegui.json [02:19:11] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P45607 and previous config saved to /var/cache/conftool/dbconfig/20230309-023411-marostegui.json [02:43:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [02:49:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P45608 and previous config saved to /var/cache/conftool/dbconfig/20230309-024917-marostegui.json [02:59:44] !log run keyholder arm on acmechief2001 [02:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [03:04:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45609 and previous config saved to /var/cache/conftool/dbconfig/20230309-030424-marostegui.json [03:04:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance [03:04:30] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [03:04:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance [03:04:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T329260)', diff saved to https://phabricator.wikimedia.org/P45610 and previous config saved to /var/cache/conftool/dbconfig/20230309-030445-marostegui.json [03:19:09] (03PS3) 10Andrea Denisse: rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) [03:20:44] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40043/console" [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [03:21:51] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/890884/40043/" [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [03:24:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T329260)', diff saved to https://phabricator.wikimedia.org/P45611 and previous config saved to /var/cache/conftool/dbconfig/20230309-032406-marostegui.json [03:24:12] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [03:29:58] (03PS1) 10Andrea Denisse: centrallog: Remove centrallog1002 from the kafka-jumbo allow list [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) [03:34:20] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40044/console" [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [03:35:30] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/895898/40044/" [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [03:39:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P45612 and previous config saved to /var/cache/conftool/dbconfig/20230309-033912-marostegui.json [03:42:20] (03PS1) 10Andrea Denisse: centrallog: Add centrallog1002 as the kafkatee active host [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) [03:43:31] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40045/console" [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [03:44:26] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/895902/40045/" [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [03:48:44] PROBLEM - dump of m2 in eqiad on backupmon1001 is CRITICAL: dump for m2 at eqiad (db1117) taken more than a week ago: Most recent backup 2023-02-28 03:17:30 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:54:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P45613 and previous config saved to /var/cache/conftool/dbconfig/20230309-035418-marostegui.json [04:09:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T329260)', diff saved to https://phabricator.wikimedia.org/P45614 and previous config saved to /var/cache/conftool/dbconfig/20230309-040925-marostegui.json [04:09:31] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [04:30:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:27:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:30:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [06:30:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [06:33:20] PROBLEM - Check systemd state on arclamp2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Schema change [06:40:34] !log Deploy schema change on s6 eqiad dbmaint T329684 [06:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Schema change [06:40:54] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [06:42:37] !log Deploy schema change on s5 eqiad dbmaint T329684 [06:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:20] !log Deploy schema change on s2 eqiad dbmaint T329684 [06:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [06:45:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [06:45:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T329203)', diff saved to https://phabricator.wikimedia.org/P45615 and previous config saved to /var/cache/conftool/dbconfig/20230309-064538-marostegui.json [06:45:47] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [06:46:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [06:46:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [06:48:01] !log Deploy schema change on s4 eqiad dbmaint T329684 [06:48:03] !log Deploy schema change on s1 eqiad dbmaint T329684 [06:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:11] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [06:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:21] (03PS1) 10Kosta Harlan: User impact: Work around MariaDB query planner bug [extensions/GrowthExperiments] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895781 (https://phabricator.wikimedia.org/T331264) [06:49:47] (03PS1) 10Kosta Harlan: User impact: Work around MariaDB query planner bug [extensions/GrowthExperiments] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895782 (https://phabricator.wikimedia.org/T331264) [06:58:46] (03PS1) 10KartikMistry: Update cxserver to 2023-03-09-061555-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/895904 (https://phabricator.wikimedia.org/T331097) [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0700) [07:00:04] kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0700) [07:02:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:02:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:02:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T329684)', diff saved to https://phabricator.wikimedia.org/P45616 and previous config saved to /var/cache/conftool/dbconfig/20230309-070223-marostegui.json [07:02:31] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [07:03:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:03:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:03:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T329684)', diff saved to https://phabricator.wikimedia.org/P45617 and previous config saved to /var/cache/conftool/dbconfig/20230309-070327-marostegui.json [07:06:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:06:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:06:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:06:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:06:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T329684)', diff saved to https://phabricator.wikimedia.org/P45618 and previous config saved to /var/cache/conftool/dbconfig/20230309-070658-marostegui.json [07:07:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P45619 and previous config saved to /var/cache/conftool/dbconfig/20230309-070733-root.json [07:08:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T329684)', diff saved to https://phabricator.wikimedia.org/P45620 and previous config saved to /var/cache/conftool/dbconfig/20230309-070805-marostegui.json [07:08:15] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [07:09:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [07:10:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance [07:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3316', diff saved to https://phabricator.wikimedia.org/P45621 and previous config saved to /var/cache/conftool/dbconfig/20230309-071029-root.json [07:11:44] (03PS1) 10Marostegui: change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) [07:12:39] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) [07:13:13] !log Deploy schema change on s8 eqiad dbmaint T329684 [07:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:19] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [07:13:56] !log Deploy schema change on s7 eqiad dbmaint T329684 [07:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 15 hosts with reason: Schema change [07:14:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 15 hosts with reason: Schema change [07:15:13] !log Deploy schema change on s3 eqiad dbmaint T329684 [07:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [07:18:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance [07:18:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T329260)', diff saved to https://phabricator.wikimedia.org/P45622 and previous config saved to /var/cache/conftool/dbconfig/20230309-071809-marostegui.json [07:18:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [07:18:18] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:18:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [07:18:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [07:18:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [07:18:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T329684)', diff saved to https://phabricator.wikimedia.org/P45623 and previous config saved to /var/cache/conftool/dbconfig/20230309-071853-marostegui.json [07:19:00] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [07:20:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45624 and previous config saved to /var/cache/conftool/dbconfig/20230309-072040-root.json [07:22:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45625 and previous config saved to /var/cache/conftool/dbconfig/20230309-072238-root.json [07:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T329203)', diff saved to https://phabricator.wikimedia.org/P45626 and previous config saved to /var/cache/conftool/dbconfig/20230309-072319-marostegui.json [07:23:25] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:23:50] (03PS2) 10Marostegui: change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) [07:31:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T329260)', diff saved to https://phabricator.wikimedia.org/P45627 and previous config saved to /var/cache/conftool/dbconfig/20230309-073127-marostegui.json [07:31:38] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [07:34:58] (03PS1) 10Marostegui: m5-proxies: Add db1176 for testing [puppet] - 10https://gerrit.wikimedia.org/r/895908 (https://phabricator.wikimedia.org/T330847) [07:35:44] (03CR) 10Marostegui: [C: 03+2] m5-proxies: Add db1176 for testing [puppet] - 10https://gerrit.wikimedia.org/r/895908 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui) [07:35:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45628 and previous config saved to /var/cache/conftool/dbconfig/20230309-073545-root.json [07:37:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45629 and previous config saved to /var/cache/conftool/dbconfig/20230309-073743-root.json [07:38:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45630 and previous config saved to /var/cache/conftool/dbconfig/20230309-073825-marostegui.json [07:39:17] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) Checked that haproxy sees db1176 just fine [07:39:22] (03PS1) 10Marostegui: Revert "m5-proxies: Add db1176 for testing" [puppet] - 10https://gerrit.wikimedia.org/r/895783 [07:40:00] (03CR) 10Marostegui: [C: 03+2] Revert "m5-proxies: Add db1176 for testing" [puppet] - 10https://gerrit.wikimedia.org/r/895783 (owner: 10Marostegui) [07:40:20] folks, I'm not feeling well enough to run the deployment window, I see no patches scheduled at this time. If somene sneaks one in at the last minute, Amir1 or jnuche, I hope one of you will be available. (Also no trianees signed up today either so no worries there.) [07:41:05] Amir.1 is on vacation [07:44:19] (03PS1) 10Marostegui: mariadb: Promote db1176 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) [07:44:31] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) [07:44:41] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover time" [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui) [07:45:37] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [07:46:02] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [07:46:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P45631 and previous config saved to /var/cache/conftool/dbconfig/20230309-074633-marostegui.json [07:47:18] (03PS1) 10Elukey: profile::calico::kubernetes: set new istio-cni defaults [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291) [07:48:26] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [07:49:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40046/console" [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291) (owner: 10Elukey) [07:50:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45632 and previous config saved to /var/cache/conftool/dbconfig/20230309-075050-root.json [07:51:29] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) [07:52:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45633 and previous config saved to /var/cache/conftool/dbconfig/20230309-075247-root.json [07:53:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45634 and previous config saved to /var/cache/conftool/dbconfig/20230309-075331-marostegui.json [07:57:26] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) p:05Triage→03Medium While I did provide data on specific racks, given our availability zones are centered around rows right now, I am gonna focus on rows. Looking at the data I note... [08:00:05] Amir1, apergos, and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0800). [08:01:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P45635 and previous config saved to /var/cache/conftool/dbconfig/20230309-080140-marostegui.json [08:05:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45636 and previous config saved to /var/cache/conftool/dbconfig/20230309-080555-root.json [08:07:13] hi, I have a patch to add to the window [08:07:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45637 and previous config saved to /var/cache/conftool/dbconfig/20230309-080752-root.json [08:08:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T329203)', diff saved to https://phabricator.wikimedia.org/P45638 and previous config saved to /var/cache/conftool/dbconfig/20230309-080837-marostegui.json [08:08:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [08:08:47] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:08:49] I can deploy [08:08:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [08:08:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T329203)', diff saved to https://phabricator.wikimedia.org/P45639 and previous config saved to /var/cache/conftool/dbconfig/20230309-080858-marostegui.json [08:09:04] (03CR) 10Muehlenhoff: [C: 03+2] logstash: Stop apache2-htcacheclean.service via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/895144 (owner: 10Muehlenhoff) [08:09:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895781 (https://phabricator.wikimedia.org/T331264) (owner: 10Kosta Harlan) [08:10:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895782 (https://phabricator.wikimedia.org/T331264) (owner: 10Kosta Harlan) [08:10:11] (03PS4) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 [08:10:17] thanks taavi [08:10:28] (03CR) 10DCausse: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/895874 (https://phabricator.wikimedia.org/T317816) (owner: 10Ryan Kemper) [08:13:12] (03PS1) 10Muehlenhoff: Fix service name [puppet] - 10https://gerrit.wikimedia.org/r/896006 [08:13:50] (03CR) 10CI reject: [V: 04-1] Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [08:16:20] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Telia, AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T329260)', diff saved to https://phabricator.wikimedia.org/P45640 and previous config saved to /var/cache/conftool/dbconfig/20230309-081646-marostegui.json [08:16:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [08:16:53] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:17:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [08:17:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T329260)', diff saved to https://phabricator.wikimedia.org/P45641 and previous config saved to /var/cache/conftool/dbconfig/20230309-081707-marostegui.json [08:17:38] (03CR) 10Muehlenhoff: [C: 03+2] Fix service name [puppet] - 10https://gerrit.wikimedia.org/r/896006 (owner: 10Muehlenhoff) [08:18:12] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:21:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45642 and previous config saved to /var/cache/conftool/dbconfig/20230309-082059-root.json [08:22:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45643 and previous config saved to /var/cache/conftool/dbconfig/20230309-082257-root.json [08:23:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1011.eqiad.wmnet with reason: remove from cluster for reimage [08:23:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1011.eqiad.wmnet with reason: remove from cluster for reimage [08:24:03] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=488c31ea-afbd-425c-93db-bb4f4daa8146) set by jmm@cumin2002 for 2 days, 0:00:00 on 1 host(s) and their services with r... [08:27:15] (03PS5) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 [08:27:44] (03Merged) 10jenkins-bot: User impact: Work around MariaDB query planner bug [extensions/GrowthExperiments] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895781 (https://phabricator.wikimedia.org/T331264) (owner: 10Kosta Harlan) [08:27:48] (03Merged) 10jenkins-bot: User impact: Work around MariaDB query planner bug [extensions/GrowthExperiments] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895782 (https://phabricator.wikimedia.org/T331264) (owner: 10Kosta Harlan) [08:27:50] (03CR) 10Giuseppe Lavagetto: Add check_dns_state to service.Service (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [08:28:15] !log taavi@deploy2002 Started scap: Backport for [[gerrit:895781|User impact: Work around MariaDB query planner bug (T331264)]], [[gerrit:895782|User impact: Work around MariaDB query planner bug (T331264)]] [08:28:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T329260)', diff saved to https://phabricator.wikimedia.org/P45644 and previous config saved to /var/cache/conftool/dbconfig/20230309-082820-marostegui.json [08:28:22] T331264: Error 2006 from GrowthExperiments\UserImpact\ComputedUserImpactLookup::getEditData - https://phabricator.wikimedia.org/T331264 [08:28:27] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [08:30:11] (03CR) 10Nicolas Fraison: Specify docker image and version consistently (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [08:30:39] !log taavi@deploy2002 taavi and kharlan: Backport for [[gerrit:895781|User impact: Work around MariaDB query planner bug (T331264)]], [[gerrit:895782|User impact: Work around MariaDB query planner bug (T331264)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:31:13] kostajh: please t est [08:31:28] (03CR) 10CI reject: [V: 04-1] Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [08:31:41] taavi: ack. both wmf.25 and wmf.26? [08:31:54] yes [08:33:17] !log remove ganeti1011 for eventual reimage T311687 [08:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:21] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [08:33:43] taavi: lgtm! [08:33:59] thanks, syncing [08:34:06] kostajh: I am going to monitor a bit the errors and see if they get gone :) [08:35:22] thanks, both [08:36:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45645 and previous config saved to /var/cache/conftool/dbconfig/20230309-083604-root.json [08:38:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45646 and previous config saved to /var/cache/conftool/dbconfig/20230309-083802-root.json [08:39:52] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:895781|User impact: Work around MariaDB query planner bug (T331264)]], [[gerrit:895782|User impact: Work around MariaDB query planner bug (T331264)]] (duration: 11m 37s) [08:39:57] T331264: Error 2006 from GrowthExperiments\UserImpact\ComputedUserImpactLookup::getEditData - https://phabricator.wikimedia.org/T331264 [08:40:01] all done [08:40:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:40:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:41:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [08:42:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) a:03cmooney [08:42:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:42:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:43:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P45647 and previous config saved to /var/cache/conftool/dbconfig/20230309-084326-marostegui.json [08:43:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [08:43:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [08:44:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T329684)', diff saved to https://phabricator.wikimedia.org/P45648 and previous config saved to /var/cache/conftool/dbconfig/20230309-084359-marostegui.json [08:44:08] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [08:44:11] hi, I was AFK, sorry [08:44:19] taavi: thanks for taking care of the deployment [08:45:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45649 and previous config saved to /var/cache/conftool/dbconfig/20230309-084543-root.json [08:46:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T329203)', diff saved to https://phabricator.wikimedia.org/P45650 and previous config saved to /var/cache/conftool/dbconfig/20230309-084601-marostegui.json [08:46:07] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:46:39] thanks taavi [08:51:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1011.eqiad.wmnet with OS bullseye [08:51:59] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS bullseye [08:52:53] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) Playing around with data using the following constraints: * We are 40%+ skewed towards using row A across all mw2* hosts (this isn't easily fixable right now) * I can only easily mess a... [08:54:18] !log Deploy schema change on s6 codfw dbmaint T329684 [08:54:20] !log Deploy schema change on s5 codfw dbmaint T329684 [08:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:22] !log Deploy schema change on s2 codfw dbmaint T329684 [08:54:24] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [08:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P45651 and previous config saved to /var/cache/conftool/dbconfig/20230309-085832-marostegui.json [08:59:16] (03PS7) 10Jelto: gitlab_runner: add optional docker registry proxy to runners [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) [09:00:04] jeena and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0900) [09:00:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45652 and previous config saved to /var/cache/conftool/dbconfig/20230309-090048-root.json [09:01:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45653 and previous config saved to /var/cache/conftool/dbconfig/20230309-090107-marostegui.json [09:06:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1011.eqiad.wmnet with reason: host reimage [09:09:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1011.eqiad.wmnet with reason: host reimage [09:11:01] (03PS6) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 [09:11:24] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40047/console" [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto) [09:12:17] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 10 hosts with reason: cr1-codfw linecard 1/0 reset [09:12:18] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on 10 hosts with reason: cr1-codfw linecard 1/0 reset [09:13:25] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: cr1-codfw linecard 1/0 reset [09:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T329260)', diff saved to https://phabricator.wikimedia.org/P45654 and previous config saved to /var/cache/conftool/dbconfig/20230309-091338-marostegui.json [09:13:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [09:13:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: cr1-codfw linecard 1/0 reset [09:13:44] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [09:13:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [09:14:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T329260)', diff saved to https://phabricator.wikimedia.org/P45655 and previous config saved to /var/cache/conftool/dbconfig/20230309-091400-marostegui.json [09:14:09] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) [09:15:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45656 and previous config saved to /var/cache/conftool/dbconfig/20230309-091552-root.json [09:16:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45657 and previous config saved to /var/cache/conftool/dbconfig/20230309-091613-marostegui.json [09:17:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "Not an expert but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/895877 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [09:17:23] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [09:17:42] (03PS1) 10Elukey: ml-services: update docker images to roll out a fix for rev-id matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/896022 [09:17:44] (03CR) 10Filippo Giunchedi: centrallog: Remove centrallog1002 from the kafka-jumbo allow list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [09:17:58] (03CR) 10Filippo Giunchedi: [C: 03+1] centrallog: Add centrallog1002 as the kafkatee active host [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [09:18:21] (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch/grafana: retry GETs too on LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/895719 (owner: 10Filippo Giunchedi) [09:18:45] (03PS8) 10Jelto: gitlab_runner: add optional docker registry proxy to runners [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) [09:18:51] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: highlight 'source' label [puppet] - 10https://gerrit.wikimedia.org/r/895713 (owner: 10Filippo Giunchedi) [09:19:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update docker images to roll out a fix for rev-id matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/896022 (owner: 10Elukey) [09:19:44] !log Deploy schema change on s7 codfw dbmaint T329684 [09:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:49] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [09:20:07] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40048/console" [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto) [09:20:17] !log jnuche@deploy2002 Installing scap version "latest" for 553 hosts [09:21:26] !log jnuche@deploy2002 Installation of scap version "latest" completed for 553 hosts [09:23:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! thank you" [puppet] - 10https://gerrit.wikimedia.org/r/895821 (owner: 10Jbond) [09:23:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1011.eqiad.wmnet with OS bullseye [09:23:45] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS bullseye completed: - ganeti1011 (**PASS**) - Downtimed on... [09:25:01] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on pfw3-codfw with reason: cr1-codfw linecard 1/0 reset [09:25:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T329260)', diff saved to https://phabricator.wikimedia.org/P45658 and previous config saved to /var/cache/conftool/dbconfig/20230309-092502-marostegui.json [09:25:11] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [09:25:16] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on pfw3-codfw with reason: cr1-codfw linecard 1/0 reset [09:27:45] !log disabling Transit cct on cr1-codfw xe-1/0/1:0 (T331527) [09:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:29] !log delete old/unused ML-related docker images from the registry - T331513 [09:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:34] T331513: Delete old ml-related docker images that are deprecated - https://phabricator.wikimedia.org/T331513 [09:29:51] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45659 and previous config saved to /var/cache/conftool/dbconfig/20230309-093057-root.json [09:31:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T329203)', diff saved to https://phabricator.wikimedia.org/P45660 and previous config saved to /var/cache/conftool/dbconfig/20230309-093120-marostegui.json [09:31:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [09:31:25] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:31:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [09:32:04] ^^^ cr2-codfw above is part of my works, overlooked the downtime on that one [09:32:24] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cr2-codfw,cr2-codfw IPv6 with reason: cr1-codfw linecard 1/0 reset [09:32:39] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-codfw,cr2-codfw IPv6 with reason: cr1-codfw linecard 1/0 reset [09:33:21] !log resetting Pic 1/0 on cr1-codfw [09:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:13] (03PS1) 10MVernon: admin: update sbassett ssh key [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554) [09:35:27] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:35:29] (03CR) 10MVernon: "Please confirm your ssh key is correct and then +1, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554) (owner: 10MVernon) [09:35:59] (03CR) 10Btullis: [C: 04-1] Specify docker image and version consistently (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [09:36:09] (03CR) 10Muehlenhoff: mod_auth_cas: add logout script for mod_auth_cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695255 (owner: 10Jbond) [09:36:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10MatthewVernon) @sbassett I've opened a CR to update your ssh key - if you can confirm it's correct and +1 the CR, I'll merge it. [09:40:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P45661 and previous config saved to /var/cache/conftool/dbconfig/20230309-094008-marostegui.json [09:40:33] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:40:57] (03CR) 10MVernon: [C: 03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui) [09:41:34] (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1176 to m5 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui) [09:46:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45662 and previous config saved to /var/cache/conftool/dbconfig/20230309-094602-root.json [09:47:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:48:37] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for 9 hosts [09:48:40] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 9 hosts [09:48:44] (03CR) 10Hashar: [C: 03+1] codesearch: Change systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [09:49:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet [09:50:51] (03CR) 10Jelto: [V: 03+1] gitlab_runner: add optional docker registry proxy to runners (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto) [09:52:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:53:07] !log Deploy schema change on s8 codfw dbmaint T329684 [09:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:12] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [09:54:19] (03PS1) 10MVernon: admin: dmartin now needs analytics-privatedata + krb [puppet] - 10https://gerrit.wikimedia.org/r/896046 (https://phabricator.wikimedia.org/T331500) [09:55:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P45663 and previous config saved to /var/cache/conftool/dbconfig/20230309-095514-marostegui.json [09:55:38] !log Deploy schema change on s4 codfw dbmaint T329684 [09:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet [09:57:35] (03PS1) 10Nicolas Fraison: hadoop-hdfs: Add alert on FSImage age [alerts] - 10https://gerrit.wikimedia.org/r/896049 (https://phabricator.wikimedia.org/T331310) [09:59:56] (03CR) 10Nicolas Fraison: Specify docker image and version consistently (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [10:00:57] (03CR) 10Klausman: [C: 03+1] profile::calico::kubernetes: set new istio-cni defaults [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291) (owner: 10Elukey) [10:01:21] (03CR) 10Klausman: [C: 03+1] ml-services: update docker images to roll out a fix for rev-id matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/896022 (owner: 10Elukey) [10:01:35] !log commencing work to drain cr2-codfw ports on card 1/0 (T331601) [10:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:16] (03CR) 10Nicolas Fraison: Specify docker image and version consistently (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [10:05:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [10:06:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [10:06:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T329203)', diff saved to https://phabricator.wikimedia.org/P45664 and previous config saved to /var/cache/conftool/dbconfig/20230309-100611-marostegui.json [10:06:16] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:06:46] (03CR) 10Elukey: [C: 03+2] ml-services: update docker images to roll out a fix for rev-id matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/896022 (owner: 10Elukey) [10:10:18] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:10:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1011.eqiad.wmnet to cluster eqiad and group C [10:10:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T329260)', diff saved to https://phabricator.wikimedia.org/P45665 and previous config saved to /var/cache/conftool/dbconfig/20230309-101020-marostegui.json [10:10:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [10:10:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1011.eqiad.wmnet to cluster eqiad and group C [10:10:30] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [10:10:31] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:10:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [10:10:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T329260)', diff saved to https://phabricator.wikimedia.org/P45666 and previous config saved to /var/cache/conftool/dbconfig/20230309-101042-marostegui.json [10:10:55] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:11:06] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:11:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet [10:11:22] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:11:35] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:11:49] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:12:00] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:13:04] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:13:09] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:13:19] sorry for the spam, broad deployment of ml model servers :) [10:13:32] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:13:35] 10SRE, 10Observability-Logging: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10fgiunchedi) This is still valid, though nowadays the implementation will be much simpler: we can ingest `webrequest_sampled` directly from Kafka! [10:13:52] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:15:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10MatthewVernon) [10:15:57] (03CR) 10Volans: [C: 03+1] "LGTM, minor style nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [10:16:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:17:58] jeena / jnuche o/ is train deploy clear? I'd like to deploy some no op config changes [10:19:03] ottomata: you can go ahead, train will happen today in US time [10:19:18] (03Abandoned) 10Ottomata: WIP - install pyflink deps with pip [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883278 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata) [10:19:25] okay, ty [10:19:28] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:19:45] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:19:50] (03PS3) 10Ottomata: ext-EventStreamConfig.php - wgEventStreams lives here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) [10:19:55] (03CR) 10Ottomata: [C: 03+2] ext-EventStreamConfig.php - wgEventStreams lives here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [10:20:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet [10:20:42] (03Merged) 10jenkins-bot: ext-EventStreamConfig.php - wgEventStreams lives here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [10:20:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1011.eqiad.wmnet to cluster eqiad and group C [10:20:59] (03CR) 10JMeybohm: [C: 03+1] "cc: Jesse as aux will probably want to adapt this, although they currently don't have any tainted nodes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris) [10:21:18] (03Abandoned) 10Arturo Borrero Gonzalez: toolforge: wmcs-k8s-get-cert.sh: fix inverted logic [puppet] - 10https://gerrit.wikimedia.org/r/895224 (owner: 10Arturo Borrero Gonzalez) [10:21:25] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:21:32] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:21:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:22:02] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 9 hosts with reason: cr2-codfw linecard 1/0 reset [10:22:13] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:22:18] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:22:21] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 9 hosts with reason: cr2-codfw linecard 1/0 reset [10:22:45] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T329260)', diff saved to https://phabricator.wikimedia.org/P45667 and previous config saved to /var/cache/conftool/dbconfig/20230309-102247-marostegui.json [10:22:52] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [10:23:30] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:24:35] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:24:55] (03CR) 10Ladsgroup: change_cuc_actor_T329684.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui) [10:25:07] (03PS4) 10Ottomata: wgEventStreams etc. - Remove duplicate configs after refactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) [10:25:17] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:25:18] (03CR) 10Ladsgroup: "I'll be afk for most of the day, so if this is fixed, it has my virtual +1." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui) [10:26:40] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:26:55] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:27:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PUT deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:27:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1011.eqiad.wmnet to cluster eqiad and group C [10:28:37] (03PS1) 10Majavah: cr-cloud: permit toolsdb return traffic to cloudcontrols [homer/public] - 10https://gerrit.wikimedia.org/r/896051 (https://phabricator.wikimedia.org/T303663) [10:29:13] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:36] !log otto@deploy2002 Synchronized wmf-config/ext-EventStreamConfig.php: Step 1a: ext-EventStreamConfig.php - wgEventStreams lives here - T308932 (duration: 06m 43s) [10:29:41] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [10:30:06] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Thanks for the patch. Please hold this change until we can clarify the setup." [homer/public] - 10https://gerrit.wikimedia.org/r/896051 (https://phabricator.wikimedia.org/T303663) (owner: 10Majavah) [10:32:13] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:36] !log hashar@deploy2002 Started deploy [integration/docroot@095a329]: Add 'Test coverage' link for MW core and a few others [10:32:44] !log hashar@deploy2002 Finished deploy [integration/docroot@095a329]: Add 'Test coverage' link for MW core and a few others (duration: 00m 08s) [10:34:47] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:35:28] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:13] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:51] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't accept NEW connections wan -> virt to internal private addresses [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585) [10:37:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P45668 and previous config saved to /var/cache/conftool/dbconfig/20230309-103753-marostegui.json [10:39:36] !log otto@deploy2002 Synchronized multiversion/MWConfigCacheGenerator.php: Step 1b: MWConfigCacheGenerator.php - load ext-EventStreamConfig.php - T308932 (duration: 06m 23s) [10:39:41] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [10:39:42] (03CR) 10CI reject: [V: 04-1] cloudgw: don't accept NEW connections wan -> virt to internal private addresses [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585) (owner: 10Arturo Borrero Gonzalez) [10:40:05] (03CR) 10Ottomata: [C: 03+2] wgEventStreams etc. - Remove duplicate configs after refactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [10:40:49] (03Merged) 10jenkins-bot: wgEventStreams etc. - Remove duplicate configs after refactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata) [10:42:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T329203)', diff saved to https://phabricator.wikimedia.org/P45669 and previous config saved to /var/cache/conftool/dbconfig/20230309-104220-marostegui.json [10:42:26] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:42:27] (03CR) 10JMeybohm: [C: 03+1] profile::calico::kubernetes: set new istio-cni defaults [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291) (owner: 10Elukey) [10:43:18] (03PS1) 10Nicolas Fraison: spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 [10:44:30] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::calico::kubernetes: set new istio-cni defaults [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291) (owner: 10Elukey) [10:44:56] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on 9 hosts with reason: cr2-codfw linecard 1/0 reset [10:45:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on 9 hosts with reason: cr2-codfw linecard 1/0 reset [10:45:28] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:21] !log Resetting PIC in slot 1/0 on cr2-codfw T331527 [10:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:49] (03PS2) 10Nicolas Fraison: spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) [10:50:55] !log otto@deploy2002 Synchronized wmf-config/ext-EventLogging.php: Step 2a: ext-EventLogging.php - remove duplicate configs - T308932 (duration: 06m 32s) [10:50:59] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [10:52:13] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:53:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P45670 and previous config saved to /var/cache/conftool/dbconfig/20230309-105259-marostegui.json [10:53:27] (03PS1) 10Nicolas Fraison: hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310) [10:53:29] (03PS1) 10Nicolas Fraison: hadoop:hdfs: fully remove FSImage nrpe check file age alert [puppet] - 10https://gerrit.wikimedia.org/r/896058 (https://phabricator.wikimedia.org/T331310) [10:53:49] (03CR) 10CI reject: [V: 04-1] hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [10:54:17] (03CR) 10Btullis: "I believe that we need to update the changelog as well, otherwise the build process will not know that this version needs to be updated." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison) [10:55:28] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45671 and previous config saved to /var/cache/conftool/dbconfig/20230309-105726-marostegui.json [10:57:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:57:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:57:32] (03PS3) 10Nicolas Fraison: spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) [10:57:37] (03CR) 10Nicolas Fraison: spark-operator: rely on exec entrypoint instead of shell one (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison) [10:58:06] woot [10:58:35] topranks: you around? [10:58:44] !incidents [10:58:45] 3467 (UNACKED) Primary outbound port utilisation over 80% (paged) global (cr2-codfw.wikimedia.org) [10:58:45] 3466 (RESOLVED) SessionStoreErrorRateHigh (eqiad) [10:58:55] !ack 3467 [10:58:55] 3467 (ACKED) Primary outbound port utilisation over 80% (paged) global (cr2-codfw.wikimedia.org) [10:59:06] (03CR) 10Btullis: "See here for the build process docs:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison) [10:59:07] marostegui: I am yes [10:59:11] * topranks looking [10:59:12] topranks: is that related to your maintenance? [10:59:36] likely related to my maintenance, which I've just finished, it's a high utilization alert [10:59:39] checking it [11:00:04] ok let me know if you want me to resolve it [11:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1100). [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1100) [11:00:25] !log otto@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Step 2b: InitialiseSettings.php - remove duplicate configs - T308932 (duration: 06m 37s) [11:00:30] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [11:00:37] (03PS2) 10Nicolas Fraison: hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310) [11:00:38] it's odd host should be downtimed [11:00:39] (03PS2) 10Nicolas Fraison: hadoop:hdfs: fully remove FSImage nrpe check file age alert [puppet] - 10https://gerrit.wikimedia.org/r/896058 (https://phabricator.wikimedia.org/T331310) [11:00:59] (03CR) 10CI reject: [V: 04-1] hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison) [11:01:39] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for 9 hosts [11:01:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 9 hosts [11:02:17] thanks topranks [11:02:28] marostegui: I resolved, not sure why the downtime didn't block it but wasn't an issue either way [11:02:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:02:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [11:02:39] just expected high use on the remaining links between the two CRs when one was down [11:02:41] both back up now [11:02:47] thanks :) [11:02:50] ack, thx [11:02:54] apologies for the noise [11:03:03] np [11:05:09] (03PS2) 10Btullis: Update the spark-operator chart with consistent image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) [11:07:06] (03CR) 10Btullis: Update the spark-operator chart with consistent image versions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:08:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T329260)', diff saved to https://phabricator.wikimedia.org/P45672 and previous config saved to /var/cache/conftool/dbconfig/20230309-110806-marostegui.json [11:08:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:08:11] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [11:08:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [11:08:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45673 and previous config saved to /var/cache/conftool/dbconfig/20230309-110827-marostegui.json [11:08:33] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) [11:08:35] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Prepare puppet master infrastructure for bullseye - https://phabricator.wikimedia.org/T285086 (10MoritzMuehlenhoff) 05Open→03Declined This task got replaced/superceded by https://phabricator.wikimedia.org/T330490 [11:08:49] (03PS3) 10Btullis: Update the spark-operator chart with consistent image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) [11:09:48] (03CR) 10Btullis: [C: 03+1] "Adding jayme and otto as reviewers for good measure." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison) [11:10:54] (03CR) 10Btullis: "Bumped version again as a result of this change: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/896053" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:12:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45674 and previous config saved to /var/cache/conftool/dbconfig/20230309-111233-marostegui.json [11:14:05] 10SRE, 10Machine-Learning-Team, 10serviceops-radar, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10akosiaris) I am tentatively removing #service-deployment-requests as I don't see how #serviceops (the onwer of that tag) has anything to do with this... [11:14:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [11:14:36] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10Clement_Goubert) That looks a lot better balanced even without touching row A skew, we wouldn't dip below 50% capacity in any cluster if we lose row A (which was the concern for jobrunners). We're... [11:16:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:16:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:17:10] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: don't accept NEW connections wan -> virt to internal private addresses [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585) [11:18:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40049/console" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [11:18:53] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "DONT MERGE. This needs live testing before merging." [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585) (owner: 10Arturo Borrero Gonzalez) [11:20:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45675 and previous config saved to /var/cache/conftool/dbconfig/20230309-112019-marostegui.json [11:20:24] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [11:23:01] (03CR) 10Jbond: [V: 03+1] P:rsyslog: manage /etc/logrotate.d/rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [11:24:48] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:06] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8664016, @aborrero wrote: > Please let me know if there is something I can do t... [11:26:04] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:26:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49709 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:39] 10SRE, 10Machine-Learning-Team, 10serviceops, 10Language-Team (Language-2023-January-March), 10Service-deployment-requests: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10akosiaris) I 've transformed (roughly) this to a #service-deployment-... [11:27:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T329203)', diff saved to https://phabricator.wikimedia.org/P45676 and previous config saved to /var/cache/conftool/dbconfig/20230309-112739-marostegui.json [11:27:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [11:27:45] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:27:54] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:27:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [11:27:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [11:27:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [11:28:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T329203)', diff saved to https://phabricator.wikimedia.org/P45677 and previous config saved to /var/cache/conftool/dbconfig/20230309-112804-marostegui.json [11:28:32] 10SRE, 10Infrastructure-Foundations: Migrate apt repository to bullseye or bookworm - https://phabricator.wikimedia.org/T331613 (10MoritzMuehlenhoff) [11:28:58] 10SRE, 10Infrastructure-Foundations: Migrate apt repository to bullseye or bookworm - https://phabricator.wikimedia.org/T331613 (10MoritzMuehlenhoff) [11:30:58] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/896060 [11:33:04] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/896060 (owner: 10Muehlenhoff) [11:33:52] (03PS5) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [11:35:09] (03CR) 10CI reject: [V: 04-1] P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [11:35:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P45678 and previous config saved to /var/cache/conftool/dbconfig/20230309-113525-marostegui.json [11:37:39] (03PS6) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [11:37:41] (03PS7) 10Jbond: pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757 [11:38:00] (03CR) 10Jbond: [C: 03+2] P:blackbox_exporter: update client auth checks to use local certs [puppet] - 10https://gerrit.wikimedia.org/r/895821 (owner: 10Jbond) [11:38:12] (03CR) 10Jbond: [C: 03+2] pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757 (owner: 10Jbond) [11:39:29] (03CR) 10CI reject: [V: 04-1] P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [11:39:39] (03PS3) 10Marostegui: change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) [11:40:22] !log installing git security updates [11:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:28] (03CR) 10Marostegui: change_cuc_actor_T329684.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui) [11:41:36] (03CR) 10Marostegui: [C: 03+2] change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui) [11:41:58] (03Merged) 10jenkins-bot: change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui) [11:42:25] (03CR) 10Jbond: "LGTM but still needs manage approval" [puppet] - 10https://gerrit.wikimedia.org/r/896046 (https://phabricator.wikimedia.org/T331500) (owner: 10MVernon) [11:42:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10jbond) [11:43:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [11:43:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [11:43:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T329684)', diff saved to https://phabricator.wikimedia.org/P45679 and previous config saved to /var/cache/conftool/dbconfig/20230309-114338-marostegui.json [11:43:43] T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684 [11:44:18] (03PS7) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [11:44:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [11:44:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [11:45:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45680 and previous config saved to /var/cache/conftool/dbconfig/20230309-114500-root.json [11:45:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40052/console" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [11:46:45] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8679314, @cmooney wrote: >>>! In T327919#8664016, @aborrero wrote: >> Please l... [11:47:43] !log Deploy schema change on s1 codfw dbmaint T329684 [11:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P45681 and previous config saved to /var/cache/conftool/dbconfig/20230309-115031-marostegui.json [11:51:10] (ProbeDown) firing: (13) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:14] (03PS3) 10Btullis: Add an entry in the service catalog for the aqs service running in codfw [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) [11:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T329203)', diff saved to https://phabricator.wikimedia.org/P45682 and previous config saved to /var/cache/conftool/dbconfig/20230309-115445-marostegui.json [11:54:49] (ProbeDown) firing: (34) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:51] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:56:46] (03PS3) 10Btullis: Add forward and reverse entries for aqs.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) [11:58:00] (03CR) 10Btullis: [C: 03+2] Add forward and reverse entries for aqs.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [12:00:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45683 and previous config saved to /var/cache/conftool/dbconfig/20230309-120005-root.json [12:01:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10jbond) [12:01:46] (03CR) 10Jbond: [C: 03+1] admin: dmartin now needs analytics-privatedata + krb [puppet] - 10https://gerrit.wikimedia.org/r/896046 (https://phabricator.wikimedia.org/T331500) (owner: 10MVernon) [12:03:40] (03CR) 10MVernon: [C: 03+2] admin: dmartin now needs analytics-privatedata + krb [puppet] - 10https://gerrit.wikimedia.org/r/896046 (https://phabricator.wikimedia.org/T331500) (owner: 10MVernon) [12:04:49] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert) [12:05:00] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40056/console" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [12:05:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45684 and previous config saved to /var/cache/conftool/dbconfig/20230309-120537-marostegui.json [12:05:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:05:43] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [12:05:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:05:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45685 and previous config saved to /var/cache/conftool/dbconfig/20230309-120559-marostegui.json [12:06:22] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @DMartin-WMF all done. [12:06:45] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40057/console" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [12:08:05] (03PS2) 10Clément Goubert: Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) [12:08:19] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10MatthewVernon) [12:09:08] (03PS1) 10Jbond: promethus: move expose ssl certs to prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/896065 [12:09:35] (03CR) 10Jbond: [C: 03+2] promethus: move expose ssl certs to prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/896065 (owner: 10Jbond) [12:09:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45686 and previous config saved to /var/cache/conftool/dbconfig/20230309-120951-marostegui.json [12:13:45] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] cr-cloud: permit toolsdb return traffic to cloudcontrols (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/896051 (https://phabricator.wikimedia.org/T303663) (owner: 10Majavah) [12:13:47] (03CR) 10Vgutierrez: [V: 03+1] "pybal and the alerting system doesn't support a cluster without any administratively pooled server AFAIK so it won't be happy cause aqs@co" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [12:15:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45687 and previous config saved to /var/cache/conftool/dbconfig/20230309-121510-root.json [12:17:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45688 and previous config saved to /var/cache/conftool/dbconfig/20230309-121756-marostegui.json [12:18:02] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [12:19:49] (ProbeDown) firing: (68) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:27] (03CR) 10Alexandros Kosiaris: "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris) [12:20:39] (03PS2) 10Alexandros Kosiaris: istio wikikube: Add the proper tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 [12:21:29] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [12:22:58] !log rebalancing ganeti eqiad/C after completion of bullseye updates T311687 [12:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:03] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [12:24:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45689 and previous config saved to /var/cache/conftool/dbconfig/20230309-122458-marostegui.json [12:27:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] istio wikikube: Add the proper tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris) [12:27:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] istio wikikube: Add the proper tolerations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris) [12:30:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45690 and previous config saved to /var/cache/conftool/dbconfig/20230309-123015-root.json [12:32:59] (03Merged) 10jenkins-bot: istio wikikube: Add the proper tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris) [12:33:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P45691 and previous config saved to /var/cache/conftool/dbconfig/20230309-123303-marostegui.json [12:40:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T329203)', diff saved to https://phabricator.wikimedia.org/P45692 and previous config saved to /var/cache/conftool/dbconfig/20230309-124004-marostegui.json [12:40:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [12:40:11] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [12:40:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [12:40:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T329203)', diff saved to https://phabricator.wikimedia.org/P45693 and previous config saved to /var/cache/conftool/dbconfig/20230309-124025-marostegui.json [12:42:41] (03PS1) 10Jbond: blackbox: allow sending raw bodies: [puppet] - 10https://gerrit.wikimedia.org/r/896082 [12:42:43] (03PS1) 10Jbond: pki: use body_raw for check [puppet] - 10https://gerrit.wikimedia.org/r/896083 [12:43:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40058/console" [puppet] - 10https://gerrit.wikimedia.org/r/896083 (owner: 10Jbond) [12:44:40] (03CR) 10CI reject: [V: 04-1] pki: use body_raw for check [puppet] - 10https://gerrit.wikimedia.org/r/896083 (owner: 10Jbond) [12:46:20] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [12:47:04] (03CR) 10Btullis: [C: 03+2] Add forward and reverse entries for aqs.svc.codfw.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [12:47:06] (03PS2) 10Jbond: blackbox: allow sending raw bodies: [puppet] - 10https://gerrit.wikimedia.org/r/896082 [12:48:01] (03PS1) 10MarcoAurelio: eswikiversity: Enable SFS in enforce mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) [12:48:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P45694 and previous config saved to /var/cache/conftool/dbconfig/20230309-124809-marostegui.json [12:49:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/895844 (owner: 10Slyngshede) [12:49:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40060/console" [puppet] - 10https://gerrit.wikimedia.org/r/896082 (owner: 10Jbond) [12:49:50] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment ldap servers must be a list. [puppet] - 10https://gerrit.wikimedia.org/r/895844 (owner: 10Slyngshede) [12:50:06] (03PS3) 10Nicolas Fraison: hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310) [12:50:08] (03PS3) 10Nicolas Fraison: hadoop:hdfs: fully remove FSImage nrpe check file age alert [puppet] - 10https://gerrit.wikimedia.org/r/896058 (https://phabricator.wikimedia.org/T331310) [12:51:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40061/console" [puppet] - 10https://gerrit.wikimedia.org/r/896083 (owner: 10Jbond) [12:53:21] !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: name=aqs2001.codfw.wmnet [12:55:20] !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: cluster=aqs,dc=codfw [12:55:26] (03PS3) 10Jbond: blackbox: allow sending raw bodies: [puppet] - 10https://gerrit.wikimedia.org/r/896082 [12:55:31] (03PS2) 10Jbond: pki: use body_raw for check [puppet] - 10https://gerrit.wikimedia.org/r/896083 [12:56:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40062/console" [puppet] - 10https://gerrit.wikimedia.org/r/896082 (owner: 10Jbond) [12:57:17] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=aqs,dc=codfw [12:58:02] (03CR) 10Jelto: "left some feedback in-line" [puppet] - 10https://gerrit.wikimedia.org/r/895240 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [12:58:17] (03CR) 10Nicolas Fraison: Update the spark-operator chart with consistent image versions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [12:59:34] (03CR) 10Btullis: Add an entry in the service catalog for the aqs service running in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [12:59:58] (03CR) 10JMeybohm: [C: 03+1] spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison) [13:00:09] (03CR) 10Btullis: [C: 03+2] Add an entry in the service catalog for the aqs service running in codfw [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis) [13:01:55] (03CR) 10Jbond: [V: 03+1 C: 03+2] blackbox: allow sending raw bodies: [puppet] - 10https://gerrit.wikimedia.org/r/896082 (owner: 10Jbond) [13:01:59] (03CR) 10Jbond: [C: 03+2] pki: use body_raw for check [puppet] - 10https://gerrit.wikimedia.org/r/896083 (owner: 10Jbond) [13:03:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45695 and previous config saved to /var/cache/conftool/dbconfig/20230309-130315-marostegui.json [13:03:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:03:18] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: btullis-T331115 - btullis@cumin1001" [13:03:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:03:21] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [13:03:29] T331115: Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115 [13:04:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: btullis-T331115 - btullis@cumin1001" [13:04:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:09:49] (ProbeDown) firing: (68) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:11:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [13:11:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [13:11:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T329260)', diff saved to https://phabricator.wikimedia.org/P45696 and previous config saved to /var/cache/conftool/dbconfig/20230309-131136-marostegui.json [13:11:42] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [13:12:24] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.12:7232]) https://wikitech.wikimedia.org/wiki/PyBal [13:13:28] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.12:7232]) https://wikitech.wikimedia.org/wiki/PyBal [13:14:00] btullis: fyi ^^^ [13:14:16] i think this relates to what vgutier.rez mentioned [13:14:34] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 71 connections established with conf2005.codfw.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal [13:14:35] jbond: Thanks. Looking now. [13:14:40] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 89 connections established with conf2004.codfw.wmnet:4001 (min=90) https://wikitech.wikimedia.org/wiki/PyBal [13:16:05] that's expected [13:16:15] * vgutierrez taking care of it [13:16:16] Phew! [13:16:18] (ProbeDown) firing: Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aqs:7232 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:16:45] !incidents [13:16:45] 3468 (ACKED) ProbeDown (10.2.1.12 ip4 aqs:7232 probes/service http_aqs_ip4 codfw) [13:16:45] 3467 (RESOLVED) Primary outbound port utilisation over 80% (paged) global (cr2-codfw.wikimedia.org) [13:16:45] 3466 (RESOLVED) SessionStoreErrorRateHigh (eqiad) [13:17:20] !log rolling restart of pybal in lvs2009 and lvs2010 [13:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:45] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:19:20] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:19:27] (03PS1) 10Kosta Harlan: changeprop: Add rules for notificationKeepGoingJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) [13:19:49] (ProbeDown) firing: (69) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T329203)', diff saved to https://phabricator.wikimedia.org/P45697 and previous config saved to /var/cache/conftool/dbconfig/20230309-131951-marostegui.json [13:19:57] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:20:30] (03CR) 10Kosta Harlan: changeprop: Add rules for notificationKeepGoingJob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan) [13:20:32] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 90 connections established with conf2004.codfw.wmnet:4001 (min=90) https://wikitech.wikimedia.org/wiki/PyBal [13:21:18] (ProbeDown) resolved: Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aqs:7232 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:22:22] (03CR) 10Nicolas Fraison: [V: 03+2 C: 03+2] spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison) [13:22:45] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:23:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T329260)', diff saved to https://phabricator.wikimedia.org/P45698 and previous config saved to /var/cache/conftool/dbconfig/20230309-132331-marostegui.json [13:23:37] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [13:24:04] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:24:27] (03PS2) 10Kosta Harlan: changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) [13:24:49] (ProbeDown) resolved: (69) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:50] (03PS3) 10Kosta Harlan: changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) [13:26:14] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 72 connections established with conf2005.codfw.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal [13:27:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: Topology changes [13:27:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: Topology changes [13:27:41] (03CR) 10Muehlenhoff: [C: 03+1] "Very nice! There are actually two places in our Puppet which use an unhashed lookup:" [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond) [13:27:58] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [13:28:08] (03CR) 10Muehlenhoff: [C: 03+1] "PCC is also fine https://puppet-compiler.wmflabs.org/output/895811/40063/" [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond) [13:31:36] RECOVERY - Check systemd state on arclamp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:50] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:30] !log installing curl security updates [13:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45699 and previous config saved to /var/cache/conftool/dbconfig/20230309-133458-marostegui.json [13:38:35] (03PS4) 10Btullis: Update the spark-operator chart with consistent image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) [13:38:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P45700 and previous config saved to /var/cache/conftool/dbconfig/20230309-133837-marostegui.json [13:42:01] !log restarting FPM/Apache on mw canaries to pick up curl updates [13:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:52] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wi [13:43:52] d [13:45:44] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:46:47] (03PS1) 10Btullis: Upgrade Airflon on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) [13:47:49] (03PS2) 10Btullis: Upgrade Airflow on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) [13:49:35] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40064/console" [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [13:49:41] (03PS1) 10Muehlenhoff: Add urldownloader100[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/896099 (https://phabricator.wikimedia.org/T329945) [13:49:48] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45701 and previous config saved to /var/cache/conftool/dbconfig/20230309-135004-marostegui.json [13:51:10] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:53] (03CR) 10Muehlenhoff: [C: 03+2] Add urldownloader100[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/896099 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff) [13:52:32] (03PS3) 10Btullis: Upgrade Airflow on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) [13:53:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P45702 and previous config saved to /var/cache/conftool/dbconfig/20230309-135343-marostegui.json [13:53:51] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40065/console" [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [13:54:42] (03CR) 10Stevemunene: [C: 03+1] Upgrade Airflow on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [13:54:58] (03CR) 10Btullis: [V: 03+1 C: 03+2] Upgrade Airflow on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [13:57:41] (03PS1) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 [13:58:01] (03PS2) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) [13:58:03] (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [13:58:22] (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [14:00:04] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@9fba86b]: Upgrade to 2.5.1 from origin/T326194_airflow_deb_creation_with_gitlab_ci [airflow-dags@9fba86b] [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1400) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1400) [14:00:05] duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] (03PS3) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) [14:00:17] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@9fba86b]: Upgrade to 2.5.1 from origin/T326194_airflow_deb_creation_with_gitlab_ci [airflow-dags@9fba86b] (duration: 00m 13s) [14:00:28] I can deploy [14:00:29] I’m in a meeting, sorry [14:00:31] yay [14:00:39] (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [14:01:21] duesen: around? :) [14:01:39] (03PS3) 10Samtar: Bump parsoid parser cache writes to 50%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:02:17] (03PS4) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) [14:02:38] (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [14:03:48] * TheresNoTime will await duesen [14:04:19] TheresNoTime: hey! [14:04:25] o/ [14:04:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:04:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10thcipriani) >>! In T330070#8667684, @MatthewVernon wrote: > @thcipriani can I ping you about this approval, please? Yes, sorry for the delay :( — approved! [14:05:04] TheresNoTime: so... this is like the last couple of times. It just bumps a config variable, and the effect will become visible on grafana once it is hit by full traffic. Nothing to test. [14:05:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T329203)', diff saved to https://phabricator.wikimedia.org/P45703 and previous config saved to /var/cache/conftool/dbconfig/20230309-140510-marostegui.json [14:05:16] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:05:17] (03PS5) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) [14:05:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF) [14:05:23] duesen: ack, okay thank you, will just run it through :) [14:05:25] (03PS4) 10Slyngshede: data.yaml add sgimeno to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070) [14:05:27] (03Merged) 10jenkins-bot: Bump parsoid parser cache writes to 50%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler) [14:05:41] (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [14:05:51] !log samtar@deploy2002 Started scap: Backport for [[gerrit:886905|Bump parsoid parser cache writes to 50%. (T320534)]] [14:05:55] TheresNoTime: i'll keep an eye on the dashboard [14:05:56] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:06:30] 10SRE-swift-storage, 10ConfirmEdit (CAPTCHA extension), 10Wikimedia-production-error: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T318941 (10TheresNoTime) Seeing a slight uptick (again) with these, recent: ==== Error ==== * mwversion: 1.40.0-wmf.25 * reqId: 65b5c08f-f0ab-... [14:06:38] (03CR) 10Slyngshede: [C: 03+2] data.yaml add sgimeno to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070) (owner: 10Slyngshede) [14:07:33] !log samtar@deploy2002 daniel and samtar: Backport for [[gerrit:886905|Bump parsoid parser cache writes to 50%. (T320534)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:07:39] syncing [14:07:51] (03PS6) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) [14:07:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [14:08:02] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Yes, please. I've unmounted a drive in ms-be1066 and turned on the locator light `sudo megacli -PDLocate -PhysDrv [32:15] -a0` So please go ahead. [14:08:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [14:08:36] !log testing disk-swap in ms-be1066 T329305 [14:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:41] T329305: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 [14:08:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T329260)', diff saved to https://phabricator.wikimedia.org/P45704 and previous config saved to /var/cache/conftool/dbconfig/20230309-140850-marostegui.json [14:08:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:08:55] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [14:09:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [14:09:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:09:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:09:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T329260)', diff saved to https://phabricator.wikimedia.org/P45705 and previous config saved to /var/cache/conftool/dbconfig/20230309-140915-marostegui.json [14:10:33] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [14:11:23] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@f774711]: (no justification provided) [14:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:13:20] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:886905|Bump parsoid parser cache writes to 50%. (T320534)]] (duration: 07m 28s) [14:13:25] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [14:13:48] duesen: that's now live [14:14:34] (out of curiosity, which dashboard will reflect these changes?) [14:15:34] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8679398, @aborrero wrote: > In the past we had problems with DHCP forwarding be... [14:17:05] TheresNoTime: https://grafana-rw.wikimedia.org/d/OxxOv5K4k/ve-backend-dashboard?forceLogin&from=now-1h&orgId=1&refresh=30s&to=now&viewPanel=11 [14:17:30] TheresNoTime: the green area and the grey should eventually be roughly the same size [14:17:52] TheresNoTime: the split was 80/20 before, should be 50/50 now. Looks like it's getting there. [14:17:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:18:44] nice :D [14:19:12] * TheresNoTime will be around for the next 30m if there's any other patches o/ [14:19:25] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Something has gone a bit awry, the kernel reports problems with two other drives instead: ` Mar 9 14:13:57 ms-be1066 kernel: [11683056.185701] sd 0:2:4:0: [sdf] tag#699 FAILED R... [14:19:51] TheresNoTime: thank you [14:19:53] (03PS7) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) [14:20:24] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [14:22:25] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10Jhancock.wm) @cmooney I got these repatched as depicted in the links. Thanks for waiting. Please let me know if you need anything else! [14:22:41] (03CR) 10Muehlenhoff: [C: 03+2] Extend dumps alias [puppet] - 10https://gerrit.wikimedia.org/r/895751 (owner: 10Muehlenhoff) [14:23:32] (03CR) 10Muehlenhoff: [C: 03+2] slapd: Add support to configure MDB storage backend [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [14:29:49] (03PS5) 10Btullis: Update the spark-operator chart with consistent image details [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) [14:30:26] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@f774711]: (no justification provided) (duration: 19m 03s) [14:30:32] (03PS8) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) [14:30:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [14:30:55] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Looking at these drives - ` sdz is bus info: scsi@0:2.25.0 Target Id: 25 is Enclosure Device ID: 32 Slot Number: 23 ` ` sdf is still absent but scsi@0:2.17.0 is missing Target... [14:30:59] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) 05Open→03Resolved That's great Jenn thanks! All looking good and working now :) ` cmooney@cloudsw1-b1-codfw> show interfaces descrip... [14:31:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [14:31:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [14:32:38] (03CR) 10Jelto: [C: 03+1] "lgtm. I resolved on one of my in-line comments after checking the migration of files to the config modules should be noop, because this mo" [puppet] - 10https://gerrit.wikimedia.org/r/895240 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [14:33:03] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [14:33:33] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Target Id 4 also missing [14:34:11] (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [14:34:19] !log installing apr security updates [14:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:34] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10Jclark-ctr) slot 2 is right by the handle. possibly [14:35:59] (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [14:36:06] (03PS1) 10Daniel Kinzler: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 [14:39:40] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10Jclark-ctr) Replaced drive slot 15 with test drive [14:44:15] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Can you check the drives in slots 23 and 2 are seated proper please? the kernel still can't see them. [14:48:57] (03CR) 10Tacsipacsi: [C: 03+1] Drop unused FlaggedRevs threshold level names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [14:51:20] jouncebot: nowandnext [14:51:20] For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1400) [14:51:20] For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1400) [14:51:20] In 2 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1700) [14:52:10] (03PS6) 10Zabe: Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [14:52:12] (03CR) 10Zabe: [C: 03+2] Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [14:52:21] (03PS38) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:52:23] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:52:55] (03Merged) 10jenkins-bot: Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [14:53:30] (03PS39) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:54:02] !log zabe@deploy2002 Started scap: Backport for [[gerrit:790707|Drop unused FlaggedRevs threshold level names (T277883)]] [14:54:08] T277883: Drop all low-use and unused features of FlaggedRevs to make it more maintainable - https://phabricator.wikimedia.org/T277883 [14:54:16] (03PS6) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) [14:54:27] (03PS2) 10David Caro: maintain-dbusers: skip tool accounts that are not ready [puppet] - 10https://gerrit.wikimedia.org/r/895838 (https://phabricator.wikimedia.org/T303663) [14:55:03] (03CR) 10Andrew Bogott: [C: 03+1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [14:55:24] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:55:41] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:55:52] !log zabe@deploy2002 awight and zabe: Backport for [[gerrit:790707|Drop unused FlaggedRevs threshold level names (T277883)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:56:00] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:56:17] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [14:57:04] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: skip tool accounts that are not ready [puppet] - 10https://gerrit.wikimedia.org/r/895838 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [14:58:01] (03CR) 10BBlack: [C: 03+1] dns::auth: deprecate role and update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [14:58:26] (03PS1) 10Slyngshede: R:idp_test create development service [puppet] - 10https://gerrit.wikimedia.org/r/896109 [15:00:05] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:00:09] (03PS40) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:00:12] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:00:56] (03CR) 10Tacsipacsi: Drop unused FlaggedRevs threshold level names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [15:01:18] (03CR) 10Slyngshede: "Do you see any security implications of having a service that allows callback to be directed to localhost? It would be really helpful to j" [puppet] - 10https://gerrit.wikimedia.org/r/896109 (owner: 10Slyngshede) [15:01:40] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:01:43] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:02:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: m5 master switch T330847 [15:02:04] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:02:06] T330847: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 [15:02:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: m5 master switch T330847 [15:02:47] (03PS1) 10Muehlenhoff: slapd: correct module loading [puppet] - 10https://gerrit.wikimedia.org/r/896110 (https://phabricator.wikimedia.org/T292942) [15:02:51] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [15:03:10] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [15:03:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1176 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui) [15:03:42] (03PS2) 10Marostegui: mariadb: Promote db1176 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) [15:04:08] !log close UTC afternoon backport window [15:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:14] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [15:04:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896110 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [15:04:45] 10SRE, 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2093.codfw.wmnet - https://phabricator.wikimedia.org/T330827 (10Jhancock.wm) 05Open→03Resolved [15:04:51] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:790707|Drop unused FlaggedRevs threshold level names (T277883)]] (duration: 10m 48s) [15:04:55] T277883: Drop all low-use and unused features of FlaggedRevs to make it more maintainable - https://phabricator.wikimedia.org/T277883 [15:05:10] (03CR) 10Herron: [C: 03+1] centrallog: Add centrallog1002 as the kafkatee active host [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [15:05:33] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [15:06:53] !log Disable puppet on R:acme_chief::cert for acmechief maintenance - T321309 [15:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:57] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:07:25] (03CR) 10Herron: [C: 03+1] "LGTM pending fix for commit msg typo flagged by filippo" [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [15:07:54] (03CR) 10Herron: [C: 03+1] rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [15:08:10] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [15:08:17] (03CR) 10Vgutierrez: [C: 03+1] acmechief: Set acmechief2001 as active [puppet] - 10https://gerrit.wikimedia.org/r/895860 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:08:27] (03CR) 10BCornwall: [C: 03+2] acmechief: Set acmechief2001 as active [puppet] - 10https://gerrit.wikimedia.org/r/895860 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T329260)', diff saved to https://phabricator.wikimedia.org/P45706 and previous config saved to /var/cache/conftool/dbconfig/20230309-150940-marostegui.json [15:09:46] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:10:02] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:10:04] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:10:08] PROBLEM - Disk space on urldownloader2001 is CRITICAL: DISK CRITICAL - free space: / 332 MB (3% inode=81%): /tmp 332 MB (3% inode=81%): /var/tmp 332 MB (3% inode=81%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader2001&var-datasource=codfw+prometheus/ops [15:10:36] (03PS1) 10JMeybohm: cert-manager: Enable stable certificate request names in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/896111 (https://phabricator.wikimedia.org/T304092) [15:10:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [15:10:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance [15:10:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:11:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T329203)', diff saved to https://phabricator.wikimedia.org/P45707 and previous config saved to /var/cache/conftool/dbconfig/20230309-151100-marostegui.json [15:11:05] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:11:35] (03PS2) 10JMeybohm: admin_ng: Add default-network-policy globally [deployment-charts] - 10https://gerrit.wikimedia.org/r/893018 (https://phabricator.wikimedia.org/T275035) [15:11:40] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:11:43] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:12:03] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:13:13] (03CR) 10FNegri: [C: 03+2] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri) [15:13:53] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) [after a reboot the drive in slot 2 was in a "Foreign" state; clearing that made it possible to reintroduce it with `sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0`... [15:13:57] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:14:03] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:14:04] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: reload-acme-chief-backend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:34] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for codfw cr links to cloudsw-b1-codfw. - cmooney@cumin1001" [15:14:47] acmechief1001 alert is expected [15:15:04] !log installing PHP 7.3 security updates (as shipped in Debian) [15:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:41] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for codfw cr links to cloudsw-b1-codfw. - cmooney@cumin1001" [15:15:41] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:16:31] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [15:16:55] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) All the pre-failover steps are done. Waiting for 16:00 UTC to perform the actual switch. [15:17:51] (03PS1) 10Muehlenhoff: Add Cumin aliases for IDM [puppet] - 10https://gerrit.wikimedia.org/r/896112 (https://phabricator.wikimedia.org/T320797) [15:19:03] (03PS11) 10Bking: rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:19:38] (03PS1) 10BCornwall: hieradata/common: acmechief_host as acmechief2001 [puppet] - 10https://gerrit.wikimedia.org/r/896114 (https://phabricator.wikimedia.org/T321309) [15:20:31] (03CR) 10Vgutierrez: [C: 03+1] hieradata/common: acmechief_host as acmechief2001 [puppet] - 10https://gerrit.wikimedia.org/r/896114 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:20:37] (03CR) 10BCornwall: [C: 03+2] hieradata/common: acmechief_host as acmechief2001 [puppet] - 10https://gerrit.wikimedia.org/r/896114 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:21:34] (03PS41) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:21:45] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:23:22] (03PS1) 10David Caro: replica_cnf: return skip if the account already exists [puppet] - 10https://gerrit.wikimedia.org/r/896115 [15:23:28] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P45709 and previous config saved to /var/cache/conftool/dbconfig/20230309-152447-marostegui.json [15:25:35] (03CR) 10JMeybohm: Exclude traindev from tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 (owner: 10Clément Goubert) [15:26:14] (03PS1) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 [15:26:16] (03PS1) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 [15:26:28] (03CR) 10Nicolas Fraison: [C: 04-2] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (owner: 10Nicolas Fraison) [15:26:32] (03CR) 10Nicolas Fraison: [C: 04-2] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/896117 (owner: 10Nicolas Fraison) [15:26:38] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (owner: 10Nicolas Fraison) [15:26:45] (03CR) 10CI reject: [V: 04-1] osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (owner: 10Nicolas Fraison) [15:26:59] (03PS1) 10Zabe: switch noc.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/896118 (https://phabricator.wikimedia.org/T331634) [15:27:12] (03CR) 10Subramanya Sastry: [C: 04-1] "Let us first debug the etag breakage before we make this change. We don't want this to hide a bug only to resurface when we disable RESTBa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (owner: 10Daniel Kinzler) [15:27:51] !log Enable puppet on R:acme_chief::cert - T321309 [15:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:56] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:28:51] 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) The swapped-in drive seems OK initially, I'll get swift to start using it shortly. [15:29:00] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:29:05] !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host acmechief1001.eqiad.wmnet with OS bullseye [15:29:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host acmechief1001.eqiad.wmnet with OS bullseye [15:30:11] RECOVERY - Disk space on urldownloader2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader2001&var-datasource=codfw+prometheus/ops [15:30:48] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Rule makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585) (owner: 10Arturo Borrero Gonzalez) [15:30:57] (03CR) 10JMeybohm: [C: 04-1] "Do you mind removing the k8s version conditionals again? All clusters are on 1.23 and as of I77657a2674a4546aa5088660745f09eedd5d2201" [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [15:31:01] (03PS2) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [15:31:03] (03PS2) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) [15:31:33] (03CR) 10CI reject: [V: 04-1] osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:32:36] (03PS32) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [15:32:38] (03PS3) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [15:32:40] (03PS3) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) [15:32:42] (03PS42) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:33:06] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:33:13] (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [15:33:19] (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis) [15:33:22] (03CR) 10CI reject: [V: 04-1] osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:34:36] (03Merged) 10jenkins-bot: rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:34:39] PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:38] (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:35:52] ^^ expected while acmechief1001 is being reimaged [15:36:52] (03CR) 10JMeybohm: "I'd say +1 but this needs rebase after the 1.23 upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [15:36:56] (03PS2) 10Andrea Denisse: centrallog: Remove centrallog1001 from the kafka-jumbo allow list [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) [15:37:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) Some updates on the physicals for the new cloudsw. The links to core routers are now up and c... [15:39:07] In 20 minutes I am switching over m5 db master, which will affect toolhub, mailman and some other WMCS related databases. Impact: RO for around 1 minute, reads unaffected https://phabricator.wikimedia.org/T330847 [15:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P45710 and previous config saved to /var/cache/conftool/dbconfig/20230309-153953-marostegui.json [15:40:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2163 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45711 and previous config saved to /var/cache/conftool/dbconfig/20230309-154053-root.json [15:44:41] (03CR) 10Nray: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray) [15:44:48] (03CR) 10David Caro: [C: 03+2] replica_cnf: return skip if the account already exists [puppet] - 10https://gerrit.wikimedia.org/r/896115 (owner: 10David Caro) [15:44:58] (03CR) 10Herron: "thanks! the updates lgtm in general, please see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [15:46:15] (03CR) 10BCornwall: codesearch: Change systemd Requires= to BindsTo= (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [15:54:04] 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10hashar) I am guessing it is an issue with Mailman. https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 shows a large queue **since March 7th 14:12**:... [15:55:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T329260)', diff saved to https://phabricator.wikimedia.org/P45712 and previous config saved to /var/cache/conftool/dbconfig/20230309-155459-marostegui.json [15:55:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [15:55:06] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [15:55:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [15:55:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T329260)', diff saved to https://phabricator.wikimedia.org/P45713 and previous config saved to /var/cache/conftool/dbconfig/20230309-155520-marostegui.json [15:55:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2163 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45714 and previous config saved to /var/cache/conftool/dbconfig/20230309-155558-root.json [15:56:18] (03CR) 10Giuseppe Lavagetto: Add check_dns_state to service.Service (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [15:56:25] 10SRE, 10ops-eqsin, 10ops-ulsfo, 10DC-Ops: eqsin & ulsfo: new R450s drawing far more power than R440s (power over contracted caps in both sites) - https://phabricator.wikimedia.org/T328957 (10RobH) We chatted about this during the last knams sync up call, as our racks there have a higher cap due to this.... [15:57:08] (03PS7) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 [15:57:20] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) [15:57:30] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10jbond) [15:57:36] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10jbond) 05Open→03Resolved a:03jbond [15:58:01] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: add optional docker registry proxy to runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto) [15:59:47] (03PS1) 10Zabe: noc: Publicly expose EventStreamConfig settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896121 (https://phabricator.wikimedia.org/T308932) [15:59:52] (03CR) 10Nicolas Fraison: [C: 03+1] Update the spark-operator chart with consistent image details [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [15:59:57] (03CR) 10Zabe: [C: 03+2] noc: Publicly expose EventStreamConfig settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896121 (https://phabricator.wikimedia.org/T308932) (owner: 10Zabe) [16:00:09] !log Failover m5 from db1183 to db1176 - T330847 [16:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:16] RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:17] T330847: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 [16:00:33] bd808: all done [16:00:47] (03Merged) 10jenkins-bot: noc: Publicly expose EventStreamConfig settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896121 (https://phabricator.wikimedia.org/T308932) (owner: 10Zabe) [16:01:00] Around 15 seconds RO [16:01:15] Brutal ;) [16:01:32] striker is working as expected. [16:01:45] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] centrallog: Add centrallog1002 as the kafkatee active host [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [16:01:46] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief1001.eqiad.wmnet with reason: host reimage [16:01:58] toolhub looks good too [16:02:20] !log zabe@deploy2002 Started scap: T308932 [16:02:25] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [16:02:29] bd808: including writes? [16:02:45] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [16:02:53] (03Abandoned) 10David Caro: maintain-dbusers: skip tool accounts that are not ready [puppet] - 10https://gerrit.wikimedia.org/r/895838 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [16:02:59] !log Restart mailman service T331626 [16:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:04] T331626: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 [16:03:05] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [16:03:07] marostegui: yes, on both [16:03:16] bd808: \o( [16:03:21] PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:22] (03CR) 10Andrea Denisse: [C: 03+2] centrallog: Remove centrallog1001 from the kafka-jumbo allow list [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse) [16:03:25] bd808: we are done then! [16:03:28] thanks for being around [16:03:47] thank you for doing the needful [16:04:42] 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10hashar) Icinga says OK: mailman3 queues are below the limits, but there is an alert about the runners: PROCS CRITICAL: 13 processes with UID = 38 (... [16:04:58] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief1001.eqiad.wmnet with reason: host reimage [16:05:18] (03PS1) 10MVernon: swift: bring ms-be1066 sdr1 back into service [puppet] - 10https://gerrit.wikimedia.org/r/896124 (https://phabricator.wikimedia.org/T329305) [16:06:12] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) [16:06:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T329260)', diff saved to https://phabricator.wikimedia.org/P45715 and previous config saved to /var/cache/conftool/dbconfig/20230309-160630-marostegui.json [16:06:36] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:06:51] (03PS2) 10MVernon: admin: update sbassett ssh key [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554) [16:09:39] !log zabe@deploy2002 Finished scap: T308932 (duration: 07m 19s) [16:09:44] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [16:09:47] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [16:10:05] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) [16:10:19] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) After pooling again and looking into the Swift logs, we realise... [16:10:27] 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) 05Open→03Resolved This was done, the RO time was around 15 seconds. Thanks @bd808 for the support! [16:10:41] (03PS7) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) [16:11:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2163 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45716 and previous config saved to /var/cache/conftool/dbconfig/20230309-161103-root.json [16:15:30] 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10Marostegui) It looks like the restart I made fixed it or at least it is slowly going down: https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=... [16:16:13] 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10hashar) 05Open→03Resolved a:03hashar Mail should be emitted again, it will take a bit of time to clear the queue though. That can be monitored... [16:16:21] 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10Aklapper) p:05Triage→03Unbreak! Potential regression from {T329073}, similar to {T331626} [16:18:18] RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:32] (03PS1) 10JMeybohm: Revert: cert-manager: Disable seccomProfile for k8s 1.16 compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/896128 (https://phabricator.wikimedia.org/T325292) [16:18:34] !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host acmechief1001.eqiad.wmnet with OS bullseye [16:18:43] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host acmechief1001.eqiad.wmnet with OS bullseye completed: - acmechief1001 (**PASS**) - Downtimed on Icinga/Alertmanager... [16:21:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P45717 and previous config saved to /var/cache/conftool/dbconfig/20230309-162137-marostegui.json [16:23:32] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [16:24:09] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [16:25:00] (03PS1) 10JMeybohm: Migrate away from deprecated typology annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) [16:25:08] (03CR) 10Dzahn: "Could you please coordinate with serviceops on this one" [dns] - 10https://gerrit.wikimedia.org/r/896118 (https://phabricator.wikimedia.org/T331634) (owner: 10Zabe) [16:26:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2163 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45718 and previous config saved to /var/cache/conftool/dbconfig/20230309-162608-root.json [16:26:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [16:27:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [16:27:30] (03CR) 10Dzahn: "I think https://phabricator.wikimedia.org/project/members/3158/ might be a better match than git blame in this case." [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall) [16:28:32] (03CR) 10Btullis: [C: 03+2] Update the spark-operator chart with consistent image details [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [16:28:56] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:29:39] (03CR) 10Btullis: [C: 03+2] Update the spark-operator chart with consistent image details (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [16:31:13] 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10Marostegui) Probably because of T331626 which is already fixed and recovering. It will take a bit until the queue gets emptied but the trend looks good: https://grafana.wikimedia.org/d/Gv... [16:32:07] 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10Marostegui) For the record: looks like the restart fixed it (T331626#8680413) [16:33:39] (03Merged) 10jenkins-bot: Update the spark-operator chart with consistent image details [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [16:36:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P45719 and previous config saved to /var/cache/conftool/dbconfig/20230309-163643-marostegui.json [16:36:48] marostegui: re mailman, will old messages and moderation notices be relied to the recipients or are they lost forever? [16:37:46] herzog: they are being very slowly delivered [16:37:54] (03PS1) 10JMeybohm: custom_deploy.d: Make k8s 1.23 istio configs the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/896131 (https://phabricator.wikimedia.org/T328291) [16:40:04] herzog: they should arrive when the queue gets processed [16:40:30] thanks marostegui & RhinosF1 [16:42:28] (03CR) 10JHathaway: [C: 03+1] "looks good to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm) [16:42:55] 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10hashar) >>! In T331626#8680354, @hashar wrote: > PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner' > Last... [16:43:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Cmjohnson) I received an idrac error on 3 of these hosts, I confirmed with Jeff that he is not able to access the host. I am going to try and update t... [16:47:59] (03PS1) 10Subramanya Sastry: Revert "TransformHandler: Load stashed page bundle based on ETag." [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629) [16:49:02] 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10MatthewVernon) p:05Unbreak!→03Medium [16:49:16] (03PS1) 10Cwhite: logstash: mediawiki_ecs copy http_method into place [puppet] - 10https://gerrit.wikimedia.org/r/895739 [16:49:27] 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10MatthewVernon) Setting to medium priority, because this is probably now just a case of waiting for the queue to drain. [16:50:21] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:51:46] !log Add EBGP peering from cr1-codfw to cloudsw1-b1-codfw (cloud vrf) T327919 [16:51:49] 10SRE, 10ops-eqiad: anworker1132 BBU issue/replacement - https://phabricator.wikimedia.org/T331543 (10Cmjohnson) @RhinosF1 Do I still need to troubleshoot the BBU or is no longer needed? [16:51:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T329260)', diff saved to https://phabricator.wikimedia.org/P45720 and previous config saved to /var/cache/conftool/dbconfig/20230309-165149-marostegui.json [16:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:51] T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 [16:51:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [16:51:55] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [16:51:58] (03PS1) 10JMeybohm: Move default kubernetes version to 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291) [16:52:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [16:52:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T329260)', diff saved to https://phabricator.wikimedia.org/P45721 and previous config saved to /var/cache/conftool/dbconfig/20230309-165210-marostegui.json [16:52:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Cmjohnson) Failed install but I didn't change the raid controller. [16:52:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [16:55:01] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-03-06-121941-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/896135 [16:55:02] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:56:01] jouncebot: nowandnext [16:56:01] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [16:56:01] In 0 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1700) [16:56:13] (03CR) 10Zabe: [C: 03+2] Revert "TransformHandler: Load stashed page bundle based on ETag." [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629) (owner: 10Subramanya Sastry) [16:56:37] (03PS1) 10Btullis: Remove the install-crds parameter frlom spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/896137 (https://phabricator.wikimedia.org/T315486) [16:56:40] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:58:42] (03CR) 10Nicolas Fraison: [C: 03+1] Remove the install-crds parameter frlom spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/896137 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis) [16:58:48] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2023-03-09-005633-production [puppet] - 10https://gerrit.wikimedia.org/r/895892 (https://phabricator.wikimedia.org/T330421) (owner: 10BryanDavis) [16:58:50] (03PS43) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:00:04] jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:18] (03PS2) 10JMeybohm: Move default kubernetes version to 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291) [17:00:25] seeing intermittent phabricator issues (`Unable to establish a connection to any database host (while trying "phabricator_spaces"). All masters and replicas are completely unreachable. AphrontConnectionLostQueryException: #2006: MySQL server has gone away`) [17:00:58] TheresNoTime: let me check the DBs [17:01:35] (03CR) 10Btullis: [C: 03+2] Remove the install-crds parameter frlom spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/896137 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis) [17:02:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T329260)', diff saved to https://phabricator.wikimedia.org/P45722 and previous config saved to /var/cache/conftool/dbconfig/20230309-170205-marostegui.json [17:02:11] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:02:20] (fwiw, twice in ~10 minutes, persisted a few minutes each time) [17:02:23] TheresNoTime: Everything seems to be working fine [17:02:33] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40066/console" [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [17:02:46] thank you for looking :) [17:03:12] 10SRE, 10ops-eqiad: anworker1132 BBU issue/replacement - https://phabricator.wikimedia.org/T331543 (10Cmjohnson) 05Stalled→03Resolved issue turned out to be no issue, resolving the task [17:03:16] TheresNoTime: Yeah, the graphs also do not show any weird patterns [17:03:18] marostegui: There is at least one other person who experienced that too. [17:03:35] (03PS1) 10Cwhite: logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) [17:03:40] Solar flare [17:03:59] Maybe the frontend is having issues? [17:03:59] (03PS2) 10Cwhite: logstash: mediawiki_ecs copy http_method into place [puppet] - 10https://gerrit.wikimedia.org/r/895739 (https://phabricator.wikimedia.org/T234565) [17:04:00] marostegui: Confirmed. We've (DE team) also seen transient MySQL errors from phab. Not many, but some. [17:04:01] mutante ^ [17:04:50] I can't see anything wrong with the master and the graphs are looking healthy as well [17:05:00] ah good to see some of phabricators' "funny" error messages are still around - now getting `Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL).` [17:05:08] hehe [17:05:25] I do like that better than the usual dry messages [17:05:29] some things do seem a bit slow to load. [17:05:37] (and gone, so it's very intermittent, whatever it is..) [17:05:41] (03CR) 10CI reject: [V: 04-1] logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:06:05] It is indeed very slow [17:06:16] mutante: you around to check the frontend? [17:06:35] (03CR) 10Cwhite: [C: 03+2] logstash: mediawiki_ecs copy http_method into place [puppet] - 10https://gerrit.wikimedia.org/r/895739 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:07:00] logs on phab are fairly spammy in general at normal times, but i'm seeing some "AphrontConnectionLostQueryException: #2006: MySQL server has gone away". [17:07:06] (03Merged) 10jenkins-bot: Remove the install-crds parameter frlom spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/896137 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis) [17:07:33] (03PS2) 10Cwhite: logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) [17:07:45] brennen: That could be cause the connection has been hanging for a while, and when it tries to re-use that one, it is gone [17:08:12] I am checking the proxy too [17:09:42] I have seen some errors on haproxy, I have reloaded it to see if they clear [17:10:54] Yeah, I think it was haproxy [17:11:29] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10Cmjohnson) This server is out of warranty, I am not sure if we have any spares or a battery we can swap from a decom host. I'll update the task with more info after talking with @Jclark... [17:12:03] 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Milimetric) [17:12:13] brennen btullis TheresNoTime dancy let me know if you keep seeing them, the haproxy error is now gone [17:12:19] (03CR) 10CI reject: [V: 04-1] Revert "TransformHandler: Load stashed page bundle based on ETag." [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629) (owner: 10Subramanya Sastry) [17:12:28] marostegui: ack, thanks. [17:12:34] okay :) thanks again [17:13:15] !log Add EBGP peering from cr1-codfw to cloudsw1-b1-codfw (prod links) T327919 [17:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:21] T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 [17:15:13] (03CR) 10Zabe: [C: 03+2] "recheck" [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629) (owner: 10Subramanya Sastry) [17:16:21] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) Sorry, these two patches are unrelated to this patch. Added by mistake. [17:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P45723 and previous config saved to /var/cache/conftool/dbconfig/20230309-171711-marostegui.json [17:22:58] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:26:36] (03CR) 10AOkoth: [C: 03+1] Sync more clamd.conf settings from 0.103.8 [puppet] - 10https://gerrit.wikimedia.org/r/895815 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff) [17:29:12] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:10] (03Merged) 10jenkins-bot: Revert "TransformHandler: Load stashed page bundle based on ETag." [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629) (owner: 10Subramanya Sastry) [17:31:10] (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [17:32:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P45724 and previous config saved to /var/cache/conftool/dbconfig/20230309-173217-marostegui.json [17:33:07] (03PS1) 10JHathaway: aux-k8s: fix secret location, attempt three [labs/private] - 10https://gerrit.wikimedia.org/r/896141 [17:33:37] (03CR) 10JHathaway: [V: 03+2 C: 03+2] aux-k8s: fix secret location, attempt three [labs/private] - 10https://gerrit.wikimedia.org/r/896141 (owner: 10JHathaway) [17:36:10] (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [17:36:24] !log cr1-eqiad: set routing-options static route 208.80.154.238/32 next-hop 208.80.154.10 [17:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:34] !log cr1-eqiad: set routing-options static route 208.80.154.238/32 next-hop 208.80.154.10: T330670 [17:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:38] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [17:37:39] !log cr2-eqiad: set routing-options static route 208.80.154.238/32 next-hop 208.80.154.10: T330670 [17:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:05] !log zabe@deploy2002 Started scap: Backport for [[gerrit:896030|Revert "TransformHandler: Load stashed page bundle based on ETag." (T331629)]] [17:38:09] T331629: HTTP 412 Errors when editing Officewiki - https://phabricator.wikimedia.org/T331629 [17:38:52] (03PS11) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) [17:38:58] (03CR) 10Phedenskog: [C: 03+1] Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray) [17:39:46] !log zabe@deploy2002 zabe and ssastry: Backport for [[gerrit:896030|Revert "TransformHandler: Load stashed page bundle based on ETag." (T331629)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [17:39:58] subbu: is there a good way to test this patch? [17:40:06] (03CR) 10JHathaway: [C: 03+1] custom_deploy.d: Make k8s 1.23 istio configs the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/896131 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [17:40:31] I can try to do a bunch of edits and verify if they pass or give me a 412. [17:40:44] not a robust test but better than nothing. [17:40:51] ok [17:41:38] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:48] is it on deug then? [17:41:51] let me test. [17:41:58] yes [17:42:36] !log [ns1] set routing-options static route 208.80.153.231/32 next-hop 208.80.154.10: T330670 [17:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:41] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [17:44:06] i haven't got 412s on mwdebug .. so, go ahead with it. [17:44:32] cool, syncing [17:46:01] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) `cr2-eqiad` (replicated to `cr1-eqiad` as well): ` /* ns0 */ route 208.80.154.238/32 { next-hop 208.80.154.10; readvertise; no-reso... [17:46:10] (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [17:46:22] (03CR) 10Herron: [C: 03+1] logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:47:07] (03CR) 10JHathaway: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/896110 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [17:47:17] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:47:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T329260)', diff saved to https://phabricator.wikimedia.org/P45725 and previous config saved to /var/cache/conftool/dbconfig/20230309-174723-marostegui.json [17:47:30] T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260 [17:47:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:50:02] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:896030|Revert "TransformHandler: Load stashed page bundle based on ETag." (T331629)]] (duration: 11m 57s) [17:50:08] T331629: HTTP 412 Errors when editing Officewiki - https://phabricator.wikimedia.org/T331629 [17:50:10] subbu: should be live [17:50:16] ty [17:51:10] (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [17:52:21] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto) [17:53:54] !log cr*-codfw [ns1]: set routing-options static route 208.80.153.231/32 next-hop 208.80.153.77: T330670 [17:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:59] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [17:54:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:56:10] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [17:56:50] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) `cr2-codfw` (replicated to `cr1-codfw` as well): ` /* ns1 */ route 208.80.153.231/32 { next-hop 208.80.153.77; readvertise; no-reso... [17:59:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [18:00:04] bd808: That opportune time is upon us again. Time for a Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1800). [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1800) [18:00:05] !log cr*-codfw [ns0]: set routing-options static route 208.80.154.238/32 next-hop 208.80.153.77: T330670 [18:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:11] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [18:00:32] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:01:10] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [18:01:16] ^^ the dns may be me sorry [18:01:27] phew ok :) [18:01:30] thanks [18:01:49] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:02:07] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-03-06-121941-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/896135 (owner: 10BryanDavis) [18:03:33] Hi, is something happening with local-image-codfw? [18:03:41] *local-swift-codfw [18:03:55] I’m unable to delete one image from Serbian Wikipedia. [18:04:22] Oh, I was able to delete it now after a few tries. [18:07:16] (03PS44) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:07:21] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-03-06-121941-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/896135 (owner: 10BryanDavis) [18:08:04] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:08:25] (03PS1) 10Jbond: apereo_cas: update idp logout script [puppet] - 10https://gerrit.wikimedia.org/r/896146 [18:08:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:09:03] (03CR) 10CI reject: [V: 04-1] apereo_cas: update idp logout script [puppet] - 10https://gerrit.wikimedia.org/r/896146 (owner: 10Jbond) [18:09:10] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:09:24] (03PS8) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) [18:09:33] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:09:41] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:10:09] (03PS8) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [18:10:15] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:10:16] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:10:45] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:10:46] (03CR) 10SBassett: "I confirm that is my new public key for wikimedia production. Let me know if you'd like any additional verification!" [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554) (owner: 10MVernon) [18:11:20] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:11:51] (03PS9) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 [18:12:12] (03PS45) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:13:02] (03CR) 10Jbond: "update" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond) [18:14:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10sbassett) >>! In T331554#8678974, @MatthewVernon wrote: > @sbassett I've opened a CR to update your ssh key - if you can confirm it's corre... [18:14:38] (03PS12) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) [18:15:19] (03PS46) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:15:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts authdns[1001,2001].wikimedia.org [18:16:40] (03PS9) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) [18:17:24] (03PS2) 10SBassett: eswikiversity: Enable SFS in enforce mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) (owner: 10MarcoAurelio) [18:18:09] (03CR) 10Slyngshede: [C: 03+1] "Forgot the +1 code-review." [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond) [18:18:39] (03CR) 10SBassett: [C: 03+1] "Happy to do a +2 and then config deploy as long as Reedy or anybody else do not have any objections. I don't personally think this needs " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) (owner: 10MarcoAurelio) [18:18:52] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) `00152: FAILED: internal_api_error_UploadChunkFileException: [dc0355d4-60e7-4764-8c67-8ac4166bed53... [18:19:41] (03PS1) 10Ssingh: hiera: remove decommissionned authdns[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/896151 (https://phabricator.wikimedia.org/T330670) [18:20:44] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:06] ^ expected [18:21:09] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:21:34] (03CR) 10Ssingh: [C: 03+2] hiera: remove decommissionned authdns[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/896151 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [18:21:40] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:50] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:22:03] ^ expected, will resolve soon after homer [18:22:12] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:22:17] (03CR) 10BBlack: [C: 03+1] sites.yaml: remove authdns[12]001 [homer/public] - 10https://gerrit.wikimedia.org/r/894102 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [18:22:28] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:29] 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10Legoktm) [18:22:35] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [18:22:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:46] !log running puppet-agent on A:dns-auth to remove deprecated authdns[12]001 [18:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:13] 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10Legoktm) [18:23:49] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10Legoktm) 05Resolved→03Open p:05Triage→03Medium a:05hashar→03Marostegui [18:24:26] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: authdns[1001,2001].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:25:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: authdns[1001,2001].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:25:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts authdns[1001,2001].wikimedia.org [18:26:05] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `authdns[1001,2001].wikimedia.org` - authdns1001.wikimedia.o... [18:26:46] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:26:47] (03CR) 10JHathaway: [C: 03+2] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [18:28:34] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove authdns[12]001 [homer/public] - 10https://gerrit.wikimedia.org/r/894102 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [18:28:52] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10Legoktm) Re-opening just for tracking while we wait for the queue to go d... [18:31:06] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10Legoktm) There are 2,936 emails in the out queue, it takes ~5.1 seconds t... [18:31:31] !log homer "cr*-eqiad*" commit "Remove authdns1001 from homer, T330670" [18:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:36] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [18:32:57] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Cmjohnson) @MatthewVernon working on these now, I will let you know if I run into any blocks [18:33:34] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:34:02] (03CR) 10Ssingh: [C: 03+2] P:cumin: update alias for dns-auth to reflect changes to dns roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [18:34:30] (03PS3) 10Ssingh: P:cumin: update alias for dns-auth to reflect changes to dns roles [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) [18:34:45] !log homer "cr*-codfw*" commit "Remove authdns1001 from homer, T330670" [18:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:50] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10Jclark-ctr) @Cmjohnson we have a few batteries @BTullis if you can shut down server we can take care of it [18:34:58] !log [correction] homer "cr*-codfw*" commit "Remove authdns2001 from homer, T330670" [18:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:30] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:38:36] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [18:38:43] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:42:33] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10Legoktm) Sent [[ https://lists.wikimedia.org/hyperkitty/list/listadmins@l... [18:43:48] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:44:05] !log disable puppet on A:dns-rec to merge CR 895894 [18:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:31] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dns::auth: deprecate role and update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [18:45:38] (03PS2) 10Ssingh: dns::auth: deprecate role and update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) [18:47:01] !log enable puppet on dns4003 to merge 895894 [18:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:46] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@3419b7d]: (no justification provided) [18:50:57] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@3419b7d]: (no justification provided) (duration: 00m 10s) [18:53:52] !log enable puppet on A:dns-rec and force puppet run: T330670 [18:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:57] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [19:00:05] jeena and jnuche: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1900). [19:00:30] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:02:43] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896159 (https://phabricator.wikimedia.org/T330204) [19:02:45] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896159 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [19:03:29] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896159 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot) [19:04:29] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) `00028: FAILED: internal_api_error_UploadChunkFileException: [f6b5ef11-ddeb-4e07-ba0a-0207b4d5f33c... [19:06:23] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [19:07:55] I have changes in the netbox cookbook for [19:07:57] +ms-fe1013 1H IN A 10.64.48.149 [19:08:02] +ms-fe1013 1H IN AAAA 2620:0:861:107:10:64:48:149 [19:08:06] is it fine to merge those? [19:09:04] cmjohnson1: ^ last you worked on these? sorry if not [19:10:53] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.26 refs T330204 [19:10:57] T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204 [19:12:48] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns1003 (renamed from authdns1001) - sukhe@cumin2002" [19:14:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns1003 (renamed from authdns1001) - sukhe@cumin2002" [19:14:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:14:52] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:15:14] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns1003 [19:15:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns1003 [19:15:46] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:17:46] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 179, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:17:53] sukhe, I was working on them [19:18:00] do I need to start over? [19:18:12] cmjohnson1: sorry, no, merged [19:18:13] all good [19:18:18] no changes pending [19:18:24] okay, thanks! [19:18:28] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:20:08] (03PS1) 10Cathal Mooney: Homer changes as part of WMCS codfw migration to cloudsw1-b1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/896160 (https://phabricator.wikimedia.org/T327919) [19:20:47] (03PS3) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [19:21:28] (03PS2) 10Jbond: apereo_cas: update idp logout script [puppet] - 10https://gerrit.wikimedia.org/r/896146 [19:22:22] (03CR) 10Cathal Mooney: [C: 03+2] Homer changes as part of WMCS codfw migration to cloudsw1-b1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/896160 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [19:23:02] (03Merged) 10jenkins-bot: Homer changes as part of WMCS codfw migration to cloudsw1-b1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/896160 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [19:35:27] (03CR) 10Bking: [C: 03+2] elastic: Incr per-node shard recovery thru-put cap [puppet] - 10https://gerrit.wikimedia.org/r/895874 (https://phabricator.wikimedia.org/T317816) (owner: 10Ryan Kemper) [19:39:31] (03CR) 10Nray: "FYI, I'm planning to backport this in 1 hour" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray) [19:39:48] 10SRE, 10DNS, 10Traffic-Icebox: Consider DNSSec - https://phabricator.wikimedia.org/T26413 (10BCornwall) 05Stalled→03Open Setting to open since no work has begun to warrant a "stalled" status. [19:41:27] 10SRE, 10Acme-chief, 10Traffic-Icebox: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10BCornwall) @Vgutierrez It looks like the work you've done means that this can be closed. Is that the case? [19:46:35] 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10User-jbond: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BCornwall) 05Stalled→03Resolved Seeing as @RLazarus has kindly merged in the functionality as dictated by this ticket, closing as resolved. Any oth... [19:46:45] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 12:00:00 on an-worker1078.eqiad.wmnet with reason: Replacing RAID BBU [19:46:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 12:00:00 on an-worker1078.eqiad.wmnet with reason: Replacing RAID BBU [19:47:03] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d79d8e43-f7d6-4d5b-b758-f7be36ad2914) set by btullis@cumin1001 for 1 day, 12:00:00 on 1 host(s) and their services with... [19:51:01] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10BTullis) Thanks @Cmjohnson and @Jclark-ctr - I've shut down the machine and given it 36 hours of downtime. Please feel free to boot it whenever the battery is replaced, it should rejoin... [19:51:23] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to enable incr shard recovery throughput - ryankemper@cumin1001 - T317816 [19:51:29] T317816: Enable 10G networking in cirrus elastic clusters - https://phabricator.wikimedia.org/T317816 [19:51:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10DMartin-WMF) Thanks so much, @MatthewVernon and all! [19:52:30] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10BCornwall) Since Wikimania since 2019 lives under https://wikimania.wikimedia.org/wiki/:Wikimania, can this be closed or is there some desire to co... [19:56:55] (03PS1) 10Ssingh: hiera: add host override for dns1003 [puppet] - 10https://gerrit.wikimedia.org/r/896169 (https://phabricator.wikimedia.org/T330670) [19:58:08] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:50] (03PS1) 10Ssingh: sites.yaml: add dns1003 [homer/public] - 10https://gerrit.wikimedia.org/r/896171 (https://phabricator.wikimedia.org/T330670) [20:03:20] (03CR) 10Ssingh: [C: 03+2] hiera: add host override for dns1003 [puppet] - 10https://gerrit.wikimedia.org/r/896169 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [20:06:08] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [20:07:58] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns1003 (renamed from authdns1001) - sukhe@cumin2002" [20:09:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns1003 (renamed from authdns1001) - sukhe@cumin2002" [20:09:02] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:12:19] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache dns1003.wikimedia.org on all recursors [20:12:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dns1003.wikimedia.org on all recursors [20:12:30] (03PS1) 10Cathal Mooney: Add uRPF checks for new cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/896172 (https://phabricator.wikimedia.org/T327919) [20:12:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1003.wikimedia.org with OS bullseye [20:13:04] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye [20:13:34] (03CR) 10Cathal Mooney: [C: 03+2] Add uRPF checks for new cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/896172 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [20:14:08] (03Merged) 10jenkins-bot: Add uRPF checks for new cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/896172 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [20:24:37] !log move cloud-hosts1-b-codfw GW from core routers to cloudsw1-b1-codfw T327919 [20:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:43] T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 [20:25:14] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Trying the same file on Wikisource: https://en.wikisource.org/wiki/File:Gide_-_The_Vatican_Swindle... [20:25:20] (03PS1) 10Samtar: InitialiseSettings-labs: Enable Phonos on Beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896176 (https://phabricator.wikimedia.org/T331670) [20:25:43] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1003.wikimedia.org with OS bullseye [20:25:51] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye executed with errors... [20:28:28] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.297 second response time https://wikitech.wikimedia.org/wiki/Swift [20:30:06] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Swift [20:30:58] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1003.wikimedia.org'] [20:38:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns1003.wikimedia.org'] [20:40:19] (03PS1) 10Alexandros Kosiaris: DNM: showcase fixtures for jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/896177 [20:42:34] jouncebot: nowandnext [20:42:34] For the next 0 hour(s) and 17 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1900) [20:42:34] In 0 hour(s) and 17 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T2100) [20:43:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10Jclark-ctr) 05Open→03Resolved [20:44:15] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [20:46:05] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns2003 (renamed from authdns2001) - sukhe@cumin2002" [20:46:20] doing a beta-only config deploy prior to the backport window [20:46:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896176 (https://phabricator.wikimedia.org/T331670) (owner: 10Samtar) [20:47:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns2003 (renamed from authdns2001) - sukhe@cumin2002" [20:47:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:47:23] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable Phonos on Beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896176 (https://phabricator.wikimedia.org/T331670) (owner: 10Samtar) [20:53:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1003.wikimedia.org with OS bullseye [20:53:31] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye [20:54:44] (03PS1) 10Cathal Mooney: Remove uRPF filter for interface ae1.2118 on codfw CRs [homer/public] - 10https://gerrit.wikimedia.org/r/896179 (https://phabricator.wikimedia.org/T327919) [20:59:36] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache dns2003.wikimedia.org on all recursors [20:59:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dns2003.wikimedia.org on all recursors [21:00:04] brennen and TheresNoTime: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T2100). Please do the needful. [21:00:05] James_F and nray: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] Heya. [21:00:17] o/ I can deploy if needed [21:00:22] Sure. [21:00:23] o/ [21:00:31] Mine are trivial-ish. [21:00:55] will start with them :) [21:01:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895351 (owner: 10Jforrester) [21:01:55] (03Merged) 10jenkins-bot: Unload RenameUser, now part of core: Part I of II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895351 (owner: 10Jforrester) [21:02:05] !log samtar@deploy2002 Started scap: Backport for [[gerrit:895351|Unload RenameUser, now part of core: Part I of II]] [21:02:30] !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache dns2003.mgmt.codfw.wmnet on all recursors [21:02:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dns2003.mgmt.codfw.wmnet on all recursors [21:03:44] !log samtar@deploy2002 samtar and jforrester: Backport for [[gerrit:895351|Unload RenameUser, now part of core: Part I of II]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:04:00] doesn't this week's train still have some extensions referring the Ext\Renameuser classes? [21:04:16] taavi: The core code sets the aliases I thought. [21:04:21] Hmm. [21:04:48] (waiting, though I have just tested the extension unloaded on en.wiki via the mwdebug and nothing fell over so..) [21:04:57] (03CR) 10Cathal Mooney: [C: 03+2] Remove uRPF filter for interface ae1.2118 on codfw CRs [homer/public] - 10https://gerrit.wikimedia.org/r/896179 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [21:05:49] no, the aliases are in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Renameuser/+/refs/heads/master/includes/RenameUserSetup.php [21:05:57] Bleh. [21:06:14] I mean, these are only used on private wikis in practice. [21:06:15] would you like me to rollback? [21:06:20] Maybe. [21:06:35] But the problem is we have to land the i18n one before the train next week. [21:06:48] (03Merged) 10jenkins-bot: Remove uRPF filter for interface ae1.2118 on codfw CRs [homer/public] - 10https://gerrit.wikimedia.org/r/896179 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [21:07:04] So maybe better to back-port class changes if they blow up? [21:07:06] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1003.wikimedia.org with OS bullseye [21:07:14] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye executed with errors... [21:08:13] James_F: your call [21:08:18] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns1003 [21:08:29] TheresNoTime: Let's proceed. I'll fix things if they break. [21:08:34] ack :) [21:09:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns1003 [21:09:41] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns2003 [21:09:45] I fear that things will break silently as features might be gated behind isLoaded() calls [21:10:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1003.wikimedia.org with OS bullseye [21:10:14] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye [21:10:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns2003 [21:14:24] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:895351|Unload RenameUser, now part of core: Part I of II]] (duration: 12m 19s) [21:14:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:15:22] James_F: okay to move on to the next patch? [21:15:26] TheresNoTime: Yes [21:15:40] (and fwiw, https://codesearch.wmcloud.org/search/?q=MediaWiki%5C%5CExtension%5C%5CRenameuser&i=nope&files=&excludeFiles=&repos= doesn't *seem* to suggest much is using that namespace..?) [21:16:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895352 (owner: 10Jforrester) [21:16:48] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:16:58] (03Merged) 10jenkins-bot: Unload RenameUser, now part of core: Part II of II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895352 (owner: 10Jforrester) [21:17:12] !log samtar@deploy2002 Started scap: Backport for [[gerrit:895352|Unload RenameUser, now part of core: Part II of II]] [21:17:35] TheresNoTime: Yeah, a bunch of things have been fixed in the last few days. [21:17:55] taavi: Possibly; I'd have expected it to mostly show up in type errors, which are very noisy in prod. [21:18:48] !log samtar@deploy2002 samtar and jforrester: Backport for [[gerrit:895352|Unload RenameUser, now part of core: Part II of II]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:18:58] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adjust and remove reverse DNS records after cloudsw1-b1-codfw migration. - cmooney@cumin1001" [21:19:15] going to continue the sync [21:19:18] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [21:19:27] TheresNoTime: Thanks! [21:19:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to enable incr shard recovery throughput - ryankemper@cumin1001 - T317816 [21:19:45] T317816: Enable 10G networking in cirrus elastic clusters - https://phabricator.wikimedia.org/T317816 [21:19:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:20:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adjust and remove reverse DNS records after cloudsw1-b1-codfw migration. - cmooney@cumin1001" [21:20:01] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:23:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:24:50] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:895352|Unload RenameUser, now part of core: Part II of II]] (duration: 07m 38s) [21:25:52] deployed :) nray, you ready? [21:26:05] yes, thank you! TheresNoTime [21:26:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray) [21:26:57] (03Merged) 10jenkins-bot: Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray) [21:27:09] !log samtar@deploy2002 Started scap: Backport for [[gerrit:893542|Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 (T326829)]] [21:27:14] T326829: Make languages available to index crawlers in mobile version of article pages - https://phabricator.wikimedia.org/T326829 [21:28:45] !log samtar@deploy2002 samtar and nray: Backport for [[gerrit:893542|Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 (T326829)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:28:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:28:59] nray: that's live on mwdebug, do you need to test it? [21:29:12] TheresNoTime: Yes, I'll take a look. Thank you [21:32:11] TheresNoTime: Looks good! You can proceed [21:32:18] ack :) [21:35:37] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1003.wikimedia.org with OS bullseye [21:35:45] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye executed with errors... [21:35:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1003.wikimedia.org with OS bullseye [21:35:57] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye [21:37:53] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:893542|Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 (T326829)]] (duration: 10m 43s) [21:37:58] T326829: Make languages available to index crawlers in mobile version of article pages - https://phabricator.wikimedia.org/T326829 [21:38:01] nray: deployed :) [21:38:09] TheresNoTime: Thank you for your help! [21:38:55] !log close UTC late backport [21:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:17] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10Papaul) 05Open→03Resolved All those nodes are back up now in codfw we can resolve this task [21:49:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1003.wikimedia.org with reason: host reimage [21:51:31] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Dzahn) Hard to tell because every year the organizers of Wikimania are different people. But from experience this does tend to come back every year and m... [21:52:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [21:52:21] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye [21:52:28] (03PS1) 10Ssingh: hiera: add host override for dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/896181 (https://phabricator.wikimedia.org/T330670) [21:53:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1003.wikimedia.org with reason: host reimage [21:54:41] (03CR) 10Ssingh: [C: 03+2] hiera: add host override for dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/896181 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [21:56:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2003.wikimedia.org with OS bullseye [21:56:31] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye [22:01:37] (03PS1) 10Ssingh: sites.yaml: add dns2003 [homer/public] - 10https://gerrit.wikimedia.org/r/896183 (https://phabricator.wikimedia.org/T330670) [22:02:51] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns2003.wikimedia.org with OS bullseye [22:03:00] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye executed with errors... [22:03:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2003.wikimedia.org with OS bullseye [22:03:14] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye [22:14:21] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [22:14:23] (03PS1) 10Ssingh: hiera: add dns[12]003 to ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/896185 (https://phabricator.wikimedia.org/T330670) [22:16:15] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10BCornwall) It's a bit bizarre to want them since wikimania.wikimedia.org should default to the latest upcoming conference, wouldn't it? [22:16:31] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2003.wikimedia.org with reason: host reimage [22:18:59] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [22:19:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2003.wikimedia.org with reason: host reimage [22:20:00] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Dzahn) In the past each Wikimania had its own wiki. I think that's where that comes from. They used to be individual wikis. And each Wikimania has a tota... [22:20:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [22:20:27] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1003.wikimedia.org with OS bullseye [22:20:34] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye completed: - dns1003... [22:24:17] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10BCornwall) But now that there's a single wiki, isn't the idea of having domains with the year on them moot? [22:24:45] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for new links to cloudsw1-b1-codfw - cmooney@cumin1001" [22:25:49] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for new links to cloudsw1-b1-codfw - cmooney@cumin1001" [22:25:49] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:30:14] (03PS1) 10BryanDavis: Revert "striker: Bump container version to 2023-03-09-005633-production" [puppet] - 10https://gerrit.wikimedia.org/r/896031 (https://phabricator.wikimedia.org/T331674) [22:33:32] (03CR) 10BryanDavis: [C: 03+1] "PCC output: https://puppet-compiler.wmflabs.org/output/896031/40067/" [puppet] - 10https://gerrit.wikimedia.org/r/896031 (https://phabricator.wikimedia.org/T331674) (owner: 10BryanDavis) [22:34:54] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Aklapper) Please see the child task T202684#5735025. This task has the status `stalled` as it's blocked on T202684. No need to fragment more discussions... [22:37:42] (03CR) 10Legoktm: [C: 03+2] Revert "striker: Bump container version to 2023-03-09-005633-production" [puppet] - 10https://gerrit.wikimedia.org/r/896031 (https://phabricator.wikimedia.org/T331674) (owner: 10BryanDavis) [22:40:58] !log Forced puppet run on cloudweb100[34] to apply quick fix for T331674 [22:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:03] T331674: Some tool maintainers not showing in Striker UI - https://phabricator.wikimedia.org/T331674 [22:41:30] (03CR) 10Ssingh: [C: 03+2] hiera: add dns[12]003 to ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/896185 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [22:41:48] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [22:43:44] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [22:43:45] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2003.wikimedia.org with OS bullseye [22:43:56] 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye completed: - dns2003... [22:46:39] (03PS1) 10JHathaway: aux: explicitly disable istio injection on namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/896188 (https://phabricator.wikimedia.org/T325178) [22:46:52] (03Abandoned) 10Ssingh: sites.yaml: add dns2003 [homer/public] - 10https://gerrit.wikimedia.org/r/896183 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [22:47:00] (03Abandoned) 10Ssingh: sites.yaml: add dns1003 [homer/public] - 10https://gerrit.wikimedia.org/r/896171 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [22:47:14] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bullseye [22:47:18] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye executed with errors: - sretest2002 (**FAIL**) - Removed from Puppet and P... [22:48:50] 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10BCornwall) @Aklapper, Thanks for linking that. I'm still confused as that seems to be another task entirely: That one is about importing **older** wikis... [22:49:12] (03PS1) 10Ssingh: sites.yaml: add dns[12]003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/896190 (https://phabricator.wikimedia.org/T330670) [22:51:57] (03PS1) 10Ssingh: hiera: add dns[12]003 to authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/896191 (https://phabricator.wikimedia.org/T330670) [22:53:03] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add dns[12]003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/896190 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [22:53:45] !log run homer in cr*-{codfw,eqiad} for CR 896190 [22:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:28] (03CR) 10JHathaway: [C: 03+2] aux: explicitly disable istio injection on namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/896188 (https://phabricator.wikimedia.org/T325178) (owner: 10JHathaway) [22:58:59] (03CR) 10Ssingh: [C: 03+2] hiera: add dns[12]003 to authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/896191 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh) [23:01:04] !log pool new dns hosts dns1003 and dns2003: T330670 [23:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:09] T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 [23:04:42] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [23:04:44] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [23:09:09] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [23:09:13] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [23:27:27] (03PS1) 10BryanDavis: striker: Bump container version to 2023-03-09-185548-production [puppet] - 10https://gerrit.wikimedia.org/r/896194 (https://phabricator.wikimedia.org/T330759) [23:32:57] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@b122672]: import_ttl: replace HdfsSensor with URLSensor [23:33:11] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@b122672]: import_ttl: replace HdfsSensor with URLSensor (duration: 00m 14s) [23:47:51] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/894744/40070/" [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn) [23:52:26] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@7b25fbf]: import_ttl: correct date formatting [23:52:40] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@7b25fbf]: import_ttl: correct date formatting (duration: 00m 14s) [23:57:52] (03CR) 10Cwhite: [C: 03+1] "SGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/895878 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)