[00:03:52] <icinga-wm>	 PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T329260)', diff saved to https://phabricator.wikimedia.org/P45594 and previous config saved to /var/cache/conftool/dbconfig/20230309-000651-marostegui.json
[00:06:57] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[00:10:04] <icinga-wm>	 RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops
[00:11:36] <wikibugs>	 (03CR) 10Dzahn: "Is there a problem if hound is running even though hound_proxy is not running? I don't really know much about it and to test this change o" [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)
[00:21:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P45596 and previous config saved to /var/cache/conftool/dbconfig/20230309-002157-marostegui.json
[00:24:51] <wikibugs>	 (03CR) 10Dzahn: peopleweb: ensure each user automatically gets a public_html dir (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[00:26:08] <wikibugs>	 (03PS3) 10Dzahn: peopleweb: ensure each user automatically gets a public_html dir [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091)
[00:37:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P45597 and previous config saved to /var/cache/conftool/dbconfig/20230309-003703-marostegui.json
[00:52:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T329260)', diff saved to https://phabricator.wikimedia.org/P45598 and previous config saved to /var/cache/conftool/dbconfig/20230309-005210-marostegui.json
[00:52:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance
[00:52:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance
[00:52:16] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[00:52:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45599 and previous config saved to /var/cache/conftool/dbconfig/20230309-005220-marostegui.json
[00:59:31] <wikibugs>	 (03PS1) 10BryanDavis: striker: Bump container version to 2023-03-09-005633-production [puppet] - 10https://gerrit.wikimedia.org/r/895892 (https://phabricator.wikimedia.org/T330421)
[01:02:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10DMartin-WMF) @MatthewVernon - Per my 1:1 discussion with @dr0ptp4kt  earlier today, it would be good for me to have kerberos access.  Apologies for not mentioning that...
[01:03:16] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/895892/40040/" [puppet] - 10https://gerrit.wikimedia.org/r/895892 (https://phabricator.wikimedia.org/T330421) (owner: 10BryanDavis)
[01:05:45] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "This should be safe to merge and deploy whenever you have time on Thursday andrewbogott. There is a manual step I will need to do after th" [puppet] - 10https://gerrit.wikimedia.org/r/895892 (https://phabricator.wikimedia.org/T330421) (owner: 10BryanDavis)
[01:09:23] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/895144 (owner: 10Muehlenhoff)
[01:09:55] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alertmanager: highlight 'source' label [puppet] - 10https://gerrit.wikimedia.org/r/895713 (owner: 10Filippo Giunchedi)
[01:12:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45600 and previous config saved to /var/cache/conftool/dbconfig/20230309-011251-marostegui.json
[01:12:57] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[01:13:47] <wikibugs>	 (03CR) 10Cwhite: "SGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/895719 (owner: 10Filippo Giunchedi)
[01:18:25] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@558da74]: correct eventgate datacenter partitioning in sensors
[01:18:39] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@558da74]: correct eventgate datacenter partitioning in sensors (duration: 00m 13s)
[01:27:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P45601 and previous config saved to /var/cache/conftool/dbconfig/20230309-012757-marostegui.json
[01:34:17] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] fifo-log-demux: systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895886 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)
[01:34:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] ats-mtail: Change systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895878 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)
[01:36:36] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Looks good! I think out of an abundance of caution, when we merge this during working hours even, we should disable Puppet on A:cp and the" [puppet] - 10https://gerrit.wikimedia.org/r/895875 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)
[01:39:56] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/895875/40041/" [puppet] - 10https://gerrit.wikimedia.org/r/895875 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)
[01:43:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P45602 and previous config saved to /var/cache/conftool/dbconfig/20230309-014303-marostegui.json
[01:58:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45603 and previous config saved to /var/cache/conftool/dbconfig/20230309-015810-marostegui.json
[01:58:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance
[01:58:16] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[01:58:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance
[01:58:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45604 and previous config saved to /var/cache/conftool/dbconfig/20230309-015831-marostegui.json
[02:04:48] <wikibugs>	 (03PS1) 10Ssingh: dns::auth: deprecate role and update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670)
[02:05:46] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40042/console" [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:38] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "DO NOT MERGE until after authdns[12]001 deprecation." [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[02:13:57] <wikibugs>	 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data point - https://phabricator.wikimedia.org/T324675 (10Etonkovidova) 05Open→03R...
[02:19:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45606 and previous config saved to /var/cache/conftool/dbconfig/20230309-021905-marostegui.json
[02:19:11] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[02:26:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P45607 and previous config saved to /var/cache/conftool/dbconfig/20230309-023411-marostegui.json
[02:43:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[02:49:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P45608 and previous config saved to /var/cache/conftool/dbconfig/20230309-024917-marostegui.json
[02:59:44] <sukhe>	 !log run keyholder arm on acmechief2001
[02:59:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:03:33] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[03:04:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T329260)', diff saved to https://phabricator.wikimedia.org/P45609 and previous config saved to /var/cache/conftool/dbconfig/20230309-030424-marostegui.json
[03:04:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance
[03:04:30] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[03:04:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2181.codfw.wmnet with reason: Maintenance
[03:04:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T329260)', diff saved to https://phabricator.wikimedia.org/P45610 and previous config saved to /var/cache/conftool/dbconfig/20230309-030445-marostegui.json
[03:19:09] <wikibugs>	 (03PS3) 10Andrea Denisse: rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803)
[03:20:44] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40043/console" [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[03:21:51] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/890884/40043/" [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[03:24:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T329260)', diff saved to https://phabricator.wikimedia.org/P45611 and previous config saved to /var/cache/conftool/dbconfig/20230309-032406-marostegui.json
[03:24:12] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[03:29:58] <wikibugs>	 (03PS1) 10Andrea Denisse: centrallog: Remove centrallog1002 from the kafka-jumbo allow list [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803)
[03:34:20] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40044/console" [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[03:35:30] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/895898/40044/" [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[03:39:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P45612 and previous config saved to /var/cache/conftool/dbconfig/20230309-033912-marostegui.json
[03:42:20] <wikibugs>	 (03PS1) 10Andrea Denisse: centrallog: Add centrallog1002 as the kafkatee active host [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803)
[03:43:31] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40045/console" [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[03:44:26] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/895902/40045/" [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[03:48:44] <icinga-wm>	 PROBLEM - dump of m2 in eqiad on backupmon1001 is CRITICAL: dump for m2 at eqiad (db1117) taken more than a week ago: Most recent backup 2023-02-28 03:17:30 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:54:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P45613 and previous config saved to /var/cache/conftool/dbconfig/20230309-035418-marostegui.json
[04:09:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T329260)', diff saved to https://phabricator.wikimedia.org/P45614 and previous config saved to /var/cache/conftool/dbconfig/20230309-040925-marostegui.json
[04:09:31] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[04:30:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:27:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance
[06:27:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2112.codfw.wmnet with reason: Maintenance
[06:30:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance
[06:30:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance
[06:33:20] <icinga-wm>	 PROBLEM - Check systemd state on arclamp2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:40:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Schema change
[06:40:34] <marostegui>	 !log Deploy schema change on s6 eqiad dbmaint T329684
[06:40:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Schema change
[06:40:54] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[06:42:37] <marostegui>	 !log Deploy schema change on s5 eqiad dbmaint T329684
[06:42:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:20] <marostegui>	 !log Deploy schema change on s2 eqiad dbmaint T329684
[06:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance
[06:45:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance
[06:45:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T329203)', diff saved to https://phabricator.wikimedia.org/P45615 and previous config saved to /var/cache/conftool/dbconfig/20230309-064538-marostegui.json
[06:45:47] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[06:46:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance
[06:46:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance
[06:48:01] <marostegui>	 !log Deploy schema change on s4 eqiad dbmaint T329684
[06:48:03] <marostegui>	 !log Deploy schema change on s1 eqiad dbmaint T329684
[06:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:11] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[06:48:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:21] <wikibugs>	 (03PS1) 10Kosta Harlan: User impact: Work around MariaDB query planner bug [extensions/GrowthExperiments] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895781 (https://phabricator.wikimedia.org/T331264)
[06:49:47] <wikibugs>	 (03PS1) 10Kosta Harlan: User impact: Work around MariaDB query planner bug [extensions/GrowthExperiments] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895782 (https://phabricator.wikimedia.org/T331264)
[06:58:46] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-03-09-061555-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/895904 (https://phabricator.wikimedia.org/T331097)
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0700)
[07:00:04] <jouncebot>	 kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0700)
[07:02:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[07:02:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[07:02:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T329684)', diff saved to https://phabricator.wikimedia.org/P45616 and previous config saved to /var/cache/conftool/dbconfig/20230309-070223-marostegui.json
[07:02:31] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[07:03:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance
[07:03:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance
[07:03:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T329684)', diff saved to https://phabricator.wikimedia.org/P45617 and previous config saved to /var/cache/conftool/dbconfig/20230309-070327-marostegui.json
[07:06:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance
[07:06:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance
[07:06:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[07:06:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[07:06:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T329684)', diff saved to https://phabricator.wikimedia.org/P45618 and previous config saved to /var/cache/conftool/dbconfig/20230309-070658-marostegui.json
[07:07:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P45619 and previous config saved to /var/cache/conftool/dbconfig/20230309-070733-root.json
[07:08:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T329684)', diff saved to https://phabricator.wikimedia.org/P45620 and previous config saved to /var/cache/conftool/dbconfig/20230309-070805-marostegui.json
[07:08:15] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[07:09:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance
[07:10:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2099.codfw.wmnet with reason: Maintenance
[07:10:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3316', diff saved to https://phabricator.wikimedia.org/P45621 and previous config saved to /var/cache/conftool/dbconfig/20230309-071029-root.json
[07:11:44] <wikibugs>	 (03PS1) 10Marostegui: change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684)
[07:12:39] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris)
[07:13:13] <marostegui>	 !log Deploy schema change on s8 eqiad dbmaint T329684
[07:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:19] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[07:13:56] <marostegui>	 !log Deploy schema change on s7 eqiad dbmaint T329684
[07:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 15 hosts with reason: Schema change
[07:14:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 15 hosts with reason: Schema change
[07:15:13] <marostegui>	 !log Deploy schema change on s3 eqiad dbmaint T329684
[07:15:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance
[07:18:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2106.codfw.wmnet with reason: Maintenance
[07:18:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T329260)', diff saved to https://phabricator.wikimedia.org/P45622 and previous config saved to /var/cache/conftool/dbconfig/20230309-071809-marostegui.json
[07:18:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[07:18:18] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[07:18:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[07:18:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
[07:18:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
[07:18:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T329684)', diff saved to https://phabricator.wikimedia.org/P45623 and previous config saved to /var/cache/conftool/dbconfig/20230309-071853-marostegui.json
[07:19:00] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[07:20:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45624 and previous config saved to /var/cache/conftool/dbconfig/20230309-072040-root.json
[07:22:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45625 and previous config saved to /var/cache/conftool/dbconfig/20230309-072238-root.json
[07:23:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T329203)', diff saved to https://phabricator.wikimedia.org/P45626 and previous config saved to /var/cache/conftool/dbconfig/20230309-072319-marostegui.json
[07:23:25] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[07:23:50] <wikibugs>	 (03PS2) 10Marostegui: change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684)
[07:31:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T329260)', diff saved to https://phabricator.wikimedia.org/P45627 and previous config saved to /var/cache/conftool/dbconfig/20230309-073127-marostegui.json
[07:31:38] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[07:34:58] <wikibugs>	 (03PS1) 10Marostegui: m5-proxies: Add db1176 for testing [puppet] - 10https://gerrit.wikimedia.org/r/895908 (https://phabricator.wikimedia.org/T330847)
[07:35:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] m5-proxies: Add db1176 for testing [puppet] - 10https://gerrit.wikimedia.org/r/895908 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui)
[07:35:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45628 and previous config saved to /var/cache/conftool/dbconfig/20230309-073545-root.json
[07:37:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45629 and previous config saved to /var/cache/conftool/dbconfig/20230309-073743-root.json
[07:38:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45630 and previous config saved to /var/cache/conftool/dbconfig/20230309-073825-marostegui.json
[07:39:17] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) Checked that haproxy sees db1176 just fine
[07:39:22] <wikibugs>	 (03PS1) 10Marostegui: Revert "m5-proxies: Add db1176 for testing" [puppet] - 10https://gerrit.wikimedia.org/r/895783
[07:40:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "m5-proxies: Add db1176 for testing" [puppet] - 10https://gerrit.wikimedia.org/r/895783 (owner: 10Marostegui)
[07:40:20] <apergos>	 folks, I'm not feeling well enough to run the deployment window, I see no patches scheduled at this time. If somene sneaks one in at the last minute, Amir1 or jnuche, I hope one of you will be available. (Also no trianees signed up today either so no worries there.)
[07:41:05] <marostegui>	 Amir.1 is on vacation
[07:44:19] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1176 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847)
[07:44:31] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris)
[07:44:41] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover time" [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui)
[07:45:37] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[07:46:02] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[07:46:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P45631 and previous config saved to /var/cache/conftool/dbconfig/20230309-074633-marostegui.json
[07:47:18] <wikibugs>	 (03PS1) 10Elukey: profile::calico::kubernetes: set new istio-cni defaults [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291)
[07:48:26] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[07:49:03] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40046/console" [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291) (owner: 10Elukey)
[07:50:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45632 and previous config saved to /var/cache/conftool/dbconfig/20230309-075050-root.json
[07:51:29] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris)
[07:52:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45633 and previous config saved to /var/cache/conftool/dbconfig/20230309-075247-root.json
[07:53:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P45634 and previous config saved to /var/cache/conftool/dbconfig/20230309-075331-marostegui.json
[07:57:26] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) p:05Triage→03Medium While I did provide data on specific racks, given our availability zones are centered around rows right now, I am gonna focus on rows. Looking at the data I note...
[08:00:05] <jouncebot>	 Amir1, apergos, and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0800).
[08:01:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P45635 and previous config saved to /var/cache/conftool/dbconfig/20230309-080140-marostegui.json
[08:05:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45636 and previous config saved to /var/cache/conftool/dbconfig/20230309-080555-root.json
[08:07:13] <kostajh>	 hi, I have a patch to add to the window
[08:07:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45637 and previous config saved to /var/cache/conftool/dbconfig/20230309-080752-root.json
[08:08:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T329203)', diff saved to https://phabricator.wikimedia.org/P45638 and previous config saved to /var/cache/conftool/dbconfig/20230309-080837-marostegui.json
[08:08:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance
[08:08:47] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[08:08:49] <taavi>	 I can deploy
[08:08:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance
[08:08:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T329203)', diff saved to https://phabricator.wikimedia.org/P45639 and previous config saved to /var/cache/conftool/dbconfig/20230309-080858-marostegui.json
[08:09:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] logstash: Stop apache2-htcacheclean.service via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/895144 (owner: 10Muehlenhoff)
[08:09:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895781 (https://phabricator.wikimedia.org/T331264) (owner: 10Kosta Harlan)
[08:10:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895782 (https://phabricator.wikimedia.org/T331264) (owner: 10Kosta Harlan)
[08:10:11] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655
[08:10:17] <kostajh>	 thanks taavi 
[08:10:28] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/895874 (https://phabricator.wikimedia.org/T317816) (owner: 10Ryan Kemper)
[08:13:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix service name [puppet] - 10https://gerrit.wikimedia.org/r/896006
[08:13:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto)
[08:16:20] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Telia, AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:16:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T329260)', diff saved to https://phabricator.wikimedia.org/P45640 and previous config saved to /var/cache/conftool/dbconfig/20230309-081646-marostegui.json
[08:16:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance
[08:16:53] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[08:17:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance
[08:17:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T329260)', diff saved to https://phabricator.wikimedia.org/P45641 and previous config saved to /var/cache/conftool/dbconfig/20230309-081707-marostegui.json
[08:17:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix service name [puppet] - 10https://gerrit.wikimedia.org/r/896006 (owner: 10Muehlenhoff)
[08:18:12] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:21:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45642 and previous config saved to /var/cache/conftool/dbconfig/20230309-082059-root.json
[08:22:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45643 and previous config saved to /var/cache/conftool/dbconfig/20230309-082257-root.json
[08:23:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1011.eqiad.wmnet with reason: remove from cluster for reimage
[08:23:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1011.eqiad.wmnet with reason: remove from cluster for reimage
[08:24:03] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=488c31ea-afbd-425c-93db-bb4f4daa8146) set by jmm@cumin2002 for 2 days, 0:00:00 on 1 host(s) and their services with r...
[08:27:15] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655
[08:27:44] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: Work around MariaDB query planner bug [extensions/GrowthExperiments] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/895781 (https://phabricator.wikimedia.org/T331264) (owner: 10Kosta Harlan)
[08:27:48] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: Work around MariaDB query planner bug [extensions/GrowthExperiments] (wmf/1.40.0-wmf.25) - 10https://gerrit.wikimedia.org/r/895782 (https://phabricator.wikimedia.org/T331264) (owner: 10Kosta Harlan)
[08:27:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add check_dns_state to service.Service (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto)
[08:28:15] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:895781|User impact: Work around MariaDB query planner bug (T331264)]], [[gerrit:895782|User impact: Work around MariaDB query planner bug (T331264)]]
[08:28:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T329260)', diff saved to https://phabricator.wikimedia.org/P45644 and previous config saved to /var/cache/conftool/dbconfig/20230309-082820-marostegui.json
[08:28:22] <stashbot>	 T331264: Error 2006 from GrowthExperiments\UserImpact\ComputedUserImpactLookup::getEditData - https://phabricator.wikimedia.org/T331264
[08:28:27] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[08:30:11] <wikibugs>	 (03CR) 10Nicolas Fraison: Specify docker image and version consistently (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[08:30:39] <logmsgbot>	 !log taavi@deploy2002 taavi and kharlan: Backport for [[gerrit:895781|User impact: Work around MariaDB query planner bug (T331264)]], [[gerrit:895782|User impact: Work around MariaDB query planner bug (T331264)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[08:31:13] <taavi>	 kostajh: please t est
[08:31:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto)
[08:31:41] <kostajh>	 taavi: ack. both wmf.25 and wmf.26?
[08:31:54] <taavi>	 yes
[08:33:17] <moritzm>	 !log remove ganeti1011 for eventual reimage T311687
[08:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:21] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[08:33:43] <kostajh>	 taavi: lgtm!
[08:33:59] <taavi>	 thanks, syncing
[08:34:06] <marostegui>	 kostajh: I am going to monitor a bit the errors and see if they get gone :)
[08:35:22] <kostajh>	 thanks, both
[08:36:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45645 and previous config saved to /var/cache/conftool/dbconfig/20230309-083604-root.json
[08:38:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2117 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45646 and previous config saved to /var/cache/conftool/dbconfig/20230309-083802-root.json
[08:39:52] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:895781|User impact: Work around MariaDB query planner bug (T331264)]], [[gerrit:895782|User impact: Work around MariaDB query planner bug (T331264)]] (duration: 11m 37s)
[08:39:57] <stashbot>	 T331264: Error 2006 from GrowthExperiments\UserImpact\ComputedUserImpactLookup::getEditData - https://phabricator.wikimedia.org/T331264
[08:40:01] <taavi>	 all done
[08:40:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[08:40:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[08:41:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney)
[08:42:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) a:03cmooney
[08:42:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[08:42:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[08:43:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P45647 and previous config saved to /var/cache/conftool/dbconfig/20230309-084326-marostegui.json
[08:43:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance
[08:43:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance
[08:44:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T329684)', diff saved to https://phabricator.wikimedia.org/P45648 and previous config saved to /var/cache/conftool/dbconfig/20230309-084359-marostegui.json
[08:44:08] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[08:44:11] <jnuche>	 hi, I was AFK, sorry
[08:44:19] <jnuche>	 taavi: thanks for taking care of the deployment
[08:45:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45649 and previous config saved to /var/cache/conftool/dbconfig/20230309-084543-root.json
[08:46:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T329203)', diff saved to https://phabricator.wikimedia.org/P45650 and previous config saved to /var/cache/conftool/dbconfig/20230309-084601-marostegui.json
[08:46:07] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[08:46:39] <kostajh>	 thanks taavi 
[08:51:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1011.eqiad.wmnet with OS bullseye
[08:51:59] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS bullseye
[08:52:53] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris) Playing around with data using the following constraints:  * We are 40%+ skewed towards using row A across all mw2* hosts (this isn't easily fixable right now) * I can only easily mess a...
[08:54:18] <marostegui>	 !log Deploy schema change on s6 codfw dbmaint T329684
[08:54:20] <marostegui>	 !log Deploy schema change on s5 codfw dbmaint T329684
[08:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:22] <marostegui>	 !log Deploy schema change on s2 codfw dbmaint T329684
[08:54:24] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[08:54:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P45651 and previous config saved to /var/cache/conftool/dbconfig/20230309-085832-marostegui.json
[08:59:16] <wikibugs>	 (03PS7) 10Jelto: gitlab_runner: add optional docker registry proxy to runners [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679)
[09:00:04] <jouncebot>	 jeena and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T0900)
[09:00:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45652 and previous config saved to /var/cache/conftool/dbconfig/20230309-090048-root.json
[09:01:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45653 and previous config saved to /var/cache/conftool/dbconfig/20230309-090107-marostegui.json
[09:06:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1011.eqiad.wmnet with reason: host reimage
[09:09:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1011.eqiad.wmnet with reason: host reimage
[09:11:01] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655
[09:11:24] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40047/console" [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto)
[09:12:17] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 10 hosts with reason: cr1-codfw linecard 1/0 reset
[09:12:18] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on 10 hosts with reason: cr1-codfw linecard 1/0 reset
[09:13:25] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: cr1-codfw linecard 1/0 reset
[09:13:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T329260)', diff saved to https://phabricator.wikimedia.org/P45654 and previous config saved to /var/cache/conftool/dbconfig/20230309-091338-marostegui.json
[09:13:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance
[09:13:42] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: cr1-codfw linecard 1/0 reset
[09:13:44] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[09:13:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance
[09:14:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T329260)', diff saved to https://phabricator.wikimedia.org/P45655 and previous config saved to /var/cache/conftool/dbconfig/20230309-091400-marostegui.json
[09:14:09] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10akosiaris)
[09:15:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45656 and previous config saved to /var/cache/conftool/dbconfig/20230309-091552-root.json
[09:16:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P45657 and previous config saved to /var/cache/conftool/dbconfig/20230309-091613-marostegui.json
[09:17:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Not an expert but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/895877 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)
[09:17:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[09:17:42] <wikibugs>	 (03PS1) 10Elukey: ml-services: update docker images to roll out a fix for rev-id matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/896022
[09:17:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: centrallog: Remove centrallog1002 from the kafka-jumbo allow list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[09:17:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] centrallog: Add centrallog1002 as the kafkatee active host [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[09:18:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch/grafana: retry GETs too on LDAP sync [puppet] - 10https://gerrit.wikimedia.org/r/895719 (owner: 10Filippo Giunchedi)
[09:18:45] <wikibugs>	 (03PS8) 10Jelto: gitlab_runner: add optional docker registry proxy to runners [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679)
[09:18:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: highlight 'source' label [puppet] - 10https://gerrit.wikimedia.org/r/895713 (owner: 10Filippo Giunchedi)
[09:19:18] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update docker images to roll out a fix for rev-id matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/896022 (owner: 10Elukey)
[09:19:44] <marostegui>	 !log Deploy schema change on s7 codfw dbmaint T329684
[09:19:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:49] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[09:20:07] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40048/console" [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto)
[09:20:17] <logmsgbot>	 !log jnuche@deploy2002 Installing scap version "latest" for 553 hosts
[09:21:26] <logmsgbot>	 !log jnuche@deploy2002 Installation of scap version "latest" completed for 553 hosts
[09:23:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! thank you" [puppet] - 10https://gerrit.wikimedia.org/r/895821 (owner: 10Jbond)
[09:23:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1011.eqiad.wmnet with OS bullseye
[09:23:45] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS bullseye completed: - ganeti1011 (**PASS**)   - Downtimed on...
[09:25:01] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on pfw3-codfw with reason: cr1-codfw linecard 1/0 reset
[09:25:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T329260)', diff saved to https://phabricator.wikimedia.org/P45658 and previous config saved to /var/cache/conftool/dbconfig/20230309-092502-marostegui.json
[09:25:11] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[09:25:16] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on pfw3-codfw with reason: cr1-codfw linecard 1/0 reset
[09:27:45] <topranks>	 !log disabling Transit cct on cr1-codfw xe-1/0/1:0 (T331527)
[09:27:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:29] <elukey>	 !log delete old/unused ML-related docker images from the registry - T331513
[09:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:34] <stashbot>	 T331513: Delete old ml-related docker images that are deprecated - https://phabricator.wikimedia.org/T331513
[09:29:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:30:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45659 and previous config saved to /var/cache/conftool/dbconfig/20230309-093057-root.json
[09:31:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T329203)', diff saved to https://phabricator.wikimedia.org/P45660 and previous config saved to /var/cache/conftool/dbconfig/20230309-093120-marostegui.json
[09:31:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[09:31:25] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[09:31:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[09:32:04] <topranks>	 ^^^ cr2-codfw above is part of my works, overlooked the downtime on that one 
[09:32:24] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cr2-codfw,cr2-codfw IPv6 with reason: cr1-codfw linecard 1/0 reset
[09:32:39] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-codfw,cr2-codfw IPv6 with reason: cr1-codfw linecard 1/0 reset
[09:33:21] <topranks>	 !log resetting Pic 1/0 on cr1-codfw 
[09:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:13] <wikibugs>	 (03PS1) 10MVernon: admin: update sbassett ssh key [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554)
[09:35:27] <icinga-wm>	 RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[09:35:29] <wikibugs>	 (03CR) 10MVernon: "Please confirm your ssh key is correct and then +1, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554) (owner: 10MVernon)
[09:35:59] <wikibugs>	 (03CR) 10Btullis: [C: 04-1] Specify docker image and version consistently (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[09:36:09] <wikibugs>	 (03CR) 10Muehlenhoff: mod_auth_cas: add logout script for mod_auth_cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695255 (owner: 10Jbond)
[09:36:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10MatthewVernon) @sbassett I've opened a CR to update your ssh key - if you can confirm it's correct and +1 the CR, I'll merge it.
[09:40:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P45661 and previous config saved to /var/cache/conftool/dbconfig/20230309-094008-marostegui.json
[09:40:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:40:57] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui)
[09:41:34] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1176 to m5 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui)
[09:46:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45662 and previous config saved to /var/cache/conftool/dbconfig/20230309-094602-root.json
[09:47:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:48:37] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for 9 hosts
[09:48:40] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 9 hosts
[09:48:44] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] codesearch: Change systemd Requires= to BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)
[09:49:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet
[09:50:51] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] gitlab_runner: add optional docker registry proxy to runners (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto)
[09:52:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:53:07] <marostegui>	 !log Deploy schema change on s8 codfw dbmaint T329684
[09:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:12] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[09:54:19] <wikibugs>	 (03PS1) 10MVernon: admin: dmartin now needs analytics-privatedata + krb [puppet] - 10https://gerrit.wikimedia.org/r/896046 (https://phabricator.wikimedia.org/T331500)
[09:55:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P45663 and previous config saved to /var/cache/conftool/dbconfig/20230309-095514-marostegui.json
[09:55:38] <marostegui>	 !log Deploy schema change on s4 codfw dbmaint T329684
[09:55:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet
[09:57:35] <wikibugs>	 (03PS1) 10Nicolas Fraison: hadoop-hdfs: Add alert on FSImage age [alerts] - 10https://gerrit.wikimedia.org/r/896049 (https://phabricator.wikimedia.org/T331310)
[09:59:56] <wikibugs>	 (03CR) 10Nicolas Fraison: Specify docker image and version consistently (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[10:00:57] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::calico::kubernetes: set new istio-cni defaults [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291) (owner: 10Elukey)
[10:01:21] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: update docker images to roll out a fix for rev-id matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/896022 (owner: 10Elukey)
[10:01:35] <topranks>	 !log commencing work to drain cr2-codfw ports on card 1/0 (T331601)
[10:01:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:16] <wikibugs>	 (03CR) 10Nicolas Fraison: Specify docker image and version consistently (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[10:05:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
[10:06:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance
[10:06:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T329203)', diff saved to https://phabricator.wikimedia.org/P45664 and previous config saved to /var/cache/conftool/dbconfig/20230309-100611-marostegui.json
[10:06:16] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[10:06:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update docker images to roll out a fix for rev-id matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/896022 (owner: 10Elukey)
[10:10:18] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:10:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1011.eqiad.wmnet to cluster eqiad and group C
[10:10:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T329260)', diff saved to https://phabricator.wikimedia.org/P45665 and previous config saved to /var/cache/conftool/dbconfig/20230309-101020-marostegui.json
[10:10:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance
[10:10:25] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1011.eqiad.wmnet to cluster eqiad and group C
[10:10:30] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[10:10:31] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[10:10:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance
[10:10:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T329260)', diff saved to https://phabricator.wikimedia.org/P45666 and previous config saved to /var/cache/conftool/dbconfig/20230309-101042-marostegui.json
[10:10:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[10:11:06] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[10:11:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet
[10:11:22] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:11:35] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:11:49] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[10:12:00] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[10:13:04] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[10:13:09] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[10:13:19] <elukey>	 sorry for the spam, broad deployment of ml model servers :)
[10:13:32] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:13:35] <wikibugs>	 10SRE, 10Observability-Logging: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10fgiunchedi) This is still valid, though nowadays the implementation will be much simpler: we can ingest `webrequest_sampled` directly from Kafka!
[10:13:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:15:38] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10MatthewVernon)
[10:15:57] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, minor style nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto)
[10:16:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:17:58] <ottomata>	 jeena / jnuche o/ is train deploy clear?  I'd like to deploy some no op config changes
[10:19:03] <jnuche>	 ottomata: you can go ahead, train will happen today in US time
[10:19:18] <wikibugs>	 (03Abandoned) 10Ottomata: WIP - install pyflink deps with pip [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883278 (https://phabricator.wikimedia.org/T327494) (owner: 10Ottomata)
[10:19:25] <ottomata>	 okay, ty
[10:19:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[10:19:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[10:19:50] <wikibugs>	 (03PS3) 10Ottomata: ext-EventStreamConfig.php - wgEventStreams lives here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932)
[10:19:55] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] ext-EventStreamConfig.php - wgEventStreams lives here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata)
[10:20:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet
[10:20:42] <wikibugs>	 (03Merged) 10jenkins-bot: ext-EventStreamConfig.php - wgEventStreams lives here [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895831 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata)
[10:20:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1011.eqiad.wmnet to cluster eqiad and group C
[10:20:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "cc: Jesse as aux will probably want to adapt this, although they currently don't have any tainted nodes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris)
[10:21:18] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: toolforge: wmcs-k8s-get-cert.sh: fix inverted logic [puppet] - 10https://gerrit.wikimedia.org/r/895224 (owner: 10Arturo Borrero Gonzalez)
[10:21:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[10:21:32] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[10:21:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PUT deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:22:02] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 9 hosts with reason: cr2-codfw linecard 1/0 reset
[10:22:13] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[10:22:18] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[10:22:21] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 9 hosts with reason: cr2-codfw linecard 1/0 reset
[10:22:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:22:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T329260)', diff saved to https://phabricator.wikimedia.org/P45667 and previous config saved to /var/cache/conftool/dbconfig/20230309-102247-marostegui.json
[10:22:52] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[10:23:30] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:24:35] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:24:55] <wikibugs>	 (03CR) 10Ladsgroup: change_cuc_actor_T329684.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui)
[10:25:07] <wikibugs>	 (03PS4) 10Ottomata: wgEventStreams etc. - Remove duplicate configs after refactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932)
[10:25:17] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:25:18] <wikibugs>	 (03CR) 10Ladsgroup: "I'll be afk for most of the day, so if this is fixed, it has my virtual +1." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui)
[10:26:40] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[10:26:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[10:27:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PUT deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:27:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1011.eqiad.wmnet to cluster eqiad and group C
[10:28:37] <wikibugs>	 (03PS1) 10Majavah: cr-cloud: permit toolsdb return traffic to cloudcontrols [homer/public] - 10https://gerrit.wikimedia.org/r/896051 (https://phabricator.wikimedia.org/T303663)
[10:29:13] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:29:36] <logmsgbot>	 !log otto@deploy2002 Synchronized wmf-config/ext-EventStreamConfig.php: Step 1a: ext-EventStreamConfig.php - wgEventStreams lives here - T308932 (duration: 06m 43s)
[10:29:41] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[10:30:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Thanks for the patch. Please hold this change until we can clarify the setup." [homer/public] - 10https://gerrit.wikimedia.org/r/896051 (https://phabricator.wikimedia.org/T303663) (owner: 10Majavah)
[10:32:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:32:36] <logmsgbot>	 !log hashar@deploy2002 Started deploy [integration/docroot@095a329]: Add 'Test coverage' link for MW core and a few others
[10:32:44] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [integration/docroot@095a329]: Add 'Test coverage' link for MW core and a few others (duration: 00m 08s)
[10:34:47] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:35:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:37:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:37:51] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't accept NEW connections wan -> virt to internal private addresses [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585)
[10:37:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P45668 and previous config saved to /var/cache/conftool/dbconfig/20230309-103753-marostegui.json
[10:39:36] <logmsgbot>	 !log otto@deploy2002 Synchronized multiversion/MWConfigCacheGenerator.php: Step 1b: MWConfigCacheGenerator.php - load ext-EventStreamConfig.php - T308932 (duration: 06m 23s)
[10:39:41] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[10:39:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudgw: don't accept NEW connections wan -> virt to internal private addresses [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585) (owner: 10Arturo Borrero Gonzalez)
[10:40:05] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] wgEventStreams etc. - Remove duplicate configs after refactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata)
[10:40:49] <wikibugs>	 (03Merged) 10jenkins-bot: wgEventStreams etc. - Remove duplicate configs after refactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895832 (https://phabricator.wikimedia.org/T308932) (owner: 10Ottomata)
[10:42:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:42:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T329203)', diff saved to https://phabricator.wikimedia.org/P45669 and previous config saved to /var/cache/conftool/dbconfig/20230309-104220-marostegui.json
[10:42:26] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[10:42:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] profile::calico::kubernetes: set new istio-cni defaults [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291) (owner: 10Elukey)
[10:43:18] <wikibugs>	 (03PS1) 10Nicolas Fraison: spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053
[10:44:30] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::calico::kubernetes: set new istio-cni defaults [puppet] - 10https://gerrit.wikimedia.org/r/895911 (https://phabricator.wikimedia.org/T328291) (owner: 10Elukey)
[10:44:56] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on 9 hosts with reason: cr2-codfw linecard 1/0 reset
[10:45:04] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on 9 hosts with reason: cr2-codfw linecard 1/0 reset
[10:45:28] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:47:21] <topranks>	 !log Resetting PIC in slot 1/0 on cr2-codfw T331527
[10:47:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:49] <wikibugs>	 (03PS2) 10Nicolas Fraison: spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926)
[10:50:55] <logmsgbot>	 !log otto@deploy2002 Synchronized wmf-config/ext-EventLogging.php: Step 2a: ext-EventLogging.php - remove duplicate configs - T308932 (duration: 06m 32s)
[10:50:59] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[10:52:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:53:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P45670 and previous config saved to /var/cache/conftool/dbconfig/20230309-105259-marostegui.json
[10:53:27] <wikibugs>	 (03PS1) 10Nicolas Fraison: hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310)
[10:53:29] <wikibugs>	 (03PS1) 10Nicolas Fraison: hadoop:hdfs: fully remove FSImage nrpe check file age alert [puppet] - 10https://gerrit.wikimedia.org/r/896058 (https://phabricator.wikimedia.org/T331310)
[10:53:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison)
[10:54:17] <wikibugs>	 (03CR) 10Btullis: "I believe that we need to update the changelog as well, otherwise the build process will not know that this version needs to be updated." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison)
[10:55:28] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:57:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45671 and previous config saved to /var/cache/conftool/dbconfig/20230309-105726-marostegui.json
[10:57:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[10:57:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[10:57:32] <wikibugs>	 (03PS3) 10Nicolas Fraison: spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926)
[10:57:37] <wikibugs>	 (03CR) 10Nicolas Fraison: spark-operator: rely on exec entrypoint instead of shell one (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison)
[10:58:06] <marostegui>	 woot
[10:58:35] <marostegui>	 topranks: you around?
[10:58:44] <marostegui>	 !incidents
[10:58:45] <sirenbot>	 3467 (UNACKED)  Primary outbound port utilisation over 80%  (paged) global (cr2-codfw.wikimedia.org)
[10:58:45] <sirenbot>	 3466 (RESOLVED)  SessionStoreErrorRateHigh (eqiad)
[10:58:55] <marostegui>	 !ack 3467
[10:58:55] <sirenbot>	 3467 (ACKED)  Primary outbound port utilisation over 80%  (paged) global (cr2-codfw.wikimedia.org)
[10:59:06] <wikibugs>	 (03CR) 10Btullis: "See here for the build process docs:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison)
[10:59:07] <topranks>	 marostegui: I am yes 
[10:59:11] * topranks looking 
[10:59:12] <marostegui>	 topranks: is that related to your maintenance?
[10:59:36] <topranks>	 likely related to my maintenance, which I've just finished, it's a high utilization alert 
[10:59:39] <topranks>	 checking it 
[11:00:04] <marostegui>	 ok let me know if you want me to resolve it 
[11:00:05] <jouncebot>	 mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1100).
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1100)
[11:00:25] <logmsgbot>	 !log otto@deploy2002 Synchronized wmf-config/InitialiseSettings.php: Step 2b: InitialiseSettings.php - remove duplicate configs - T308932 (duration: 06m 37s)
[11:00:30] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[11:00:37] <wikibugs>	 (03PS2) 10Nicolas Fraison: hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310)
[11:00:38] <topranks>	 it's odd host should be downtimed
[11:00:39] <wikibugs>	 (03PS2) 10Nicolas Fraison: hadoop:hdfs: fully remove FSImage nrpe check file age alert [puppet] - 10https://gerrit.wikimedia.org/r/896058 (https://phabricator.wikimedia.org/T331310)
[11:00:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310) (owner: 10Nicolas Fraison)
[11:01:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for 9 hosts
[11:01:42] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 9 hosts
[11:02:17] <marostegui>	 thanks topranks 
[11:02:28] <topranks>	 marostegui: I resolved, not sure why the downtime didn't block it but wasn't an issue either way 
[11:02:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:02:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[11:02:39] <topranks>	 just expected high use on the remaining links between the two CRs when one was down 
[11:02:41] <topranks>	 both back up now 
[11:02:47] <marostegui>	 thanks :)
[11:02:50] <moritzm>	 ack, thx
[11:02:54] <topranks>	 apologies for the noise 
[11:03:03] <marostegui>	 np
[11:05:09] <wikibugs>	 (03PS2) 10Btullis: Update the spark-operator chart with consistent image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926)
[11:07:06] <wikibugs>	 (03CR) 10Btullis: Update the spark-operator chart with consistent image versions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[11:08:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T329260)', diff saved to https://phabricator.wikimedia.org/P45672 and previous config saved to /var/cache/conftool/dbconfig/20230309-110806-marostegui.json
[11:08:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[11:08:11] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[11:08:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[11:08:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45673 and previous config saved to /var/cache/conftool/dbconfig/20230309-110827-marostegui.json
[11:08:33] <wikibugs>	 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff)
[11:08:35] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Prepare puppet master infrastructure for bullseye - https://phabricator.wikimedia.org/T285086 (10MoritzMuehlenhoff) 05Open→03Declined This task got replaced/superceded by  https://phabricator.wikimedia.org/T330490
[11:08:49] <wikibugs>	 (03PS3) 10Btullis: Update the spark-operator chart with consistent image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926)
[11:09:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Adding jayme and otto as reviewers for good measure." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison)
[11:10:54] <wikibugs>	 (03CR) 10Btullis: "Bumped version again as a result of this change: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/896053" [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[11:12:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P45674 and previous config saved to /var/cache/conftool/dbconfig/20230309-111233-marostegui.json
[11:14:05] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops-radar, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10akosiaris) I am tentatively removing #service-deployment-requests as I don't see how #serviceops (the onwer of that tag) has anything to do with this...
[11:14:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney)
[11:14:36] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10Clement_Goubert) That looks a lot better balanced even without touching row A skew, we wouldn't dip below 50% capacity in any cluster if we lose row A (which was the concern for jobrunners). We're...
[11:16:34] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:16:52] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:17:10] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: don't accept NEW connections wan -> virt to internal private addresses [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585)
[11:18:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40049/console" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond)
[11:18:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "DONT MERGE. This needs live testing before merging." [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585) (owner: 10Arturo Borrero Gonzalez)
[11:20:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45675 and previous config saved to /var/cache/conftool/dbconfig/20230309-112019-marostegui.json
[11:20:24] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[11:23:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] P:rsyslog: manage /etc/logrotate.d/rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond)
[11:24:48] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8664016, @aborrero wrote: > Please let me know if there is something I can do t...
[11:26:04] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:26:22] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49709 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:27:39] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops, 10Language-Team (Language-2023-January-March), 10Service-deployment-requests: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10akosiaris) I 've transformed (roughly) this to a #service-deployment-...
[11:27:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T329203)', diff saved to https://phabricator.wikimedia.org/P45676 and previous config saved to /var/cache/conftool/dbconfig/20230309-112739-marostegui.json
[11:27:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
[11:27:45] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[11:27:54] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:27:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance
[11:27:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[11:27:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[11:28:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T329203)', diff saved to https://phabricator.wikimedia.org/P45677 and previous config saved to /var/cache/conftool/dbconfig/20230309-112804-marostegui.json
[11:28:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate apt repository to bullseye or bookworm - https://phabricator.wikimedia.org/T331613 (10MoritzMuehlenhoff)
[11:28:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate apt repository to bullseye or bookworm - https://phabricator.wikimedia.org/T331613 (10MoritzMuehlenhoff)
[11:30:58] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/896060
[11:33:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/896060 (owner: 10Muehlenhoff)
[11:33:52] <wikibugs>	 (03PS5) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646
[11:35:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond)
[11:35:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P45678 and previous config saved to /var/cache/conftool/dbconfig/20230309-113525-marostegui.json
[11:37:39] <wikibugs>	 (03PS6) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646
[11:37:41] <wikibugs>	 (03PS7) 10Jbond: pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757
[11:38:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:blackbox_exporter: update client auth checks to use local certs [puppet] - 10https://gerrit.wikimedia.org/r/895821 (owner: 10Jbond)
[11:38:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: Add blackbox tests for pki services [puppet] - 10https://gerrit.wikimedia.org/r/895757 (owner: 10Jbond)
[11:39:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond)
[11:39:39] <wikibugs>	 (03PS3) 10Marostegui: change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684)
[11:40:22] <moritzm>	 !log installing git security updates
[11:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:28] <wikibugs>	 (03CR) 10Marostegui: change_cuc_actor_T329684.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui)
[11:41:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui)
[11:41:58] <wikibugs>	 (03Merged) 10jenkins-bot: change_cuc_actor_T329684.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/895906 (https://phabricator.wikimedia.org/T329684) (owner: 10Marostegui)
[11:42:25] <wikibugs>	 (03CR) 10Jbond: "LGTM but still needs manage approval" [puppet] - 10https://gerrit.wikimedia.org/r/896046 (https://phabricator.wikimedia.org/T331500) (owner: 10MVernon)
[11:42:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10jbond)
[11:43:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance
[11:43:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance
[11:43:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T329684)', diff saved to https://phabricator.wikimedia.org/P45679 and previous config saved to /var/cache/conftool/dbconfig/20230309-114338-marostegui.json
[11:43:43] <stashbot>	 T329684: Drop default value from cuc_actor and cuc_comment_id on wmf wikis - https://phabricator.wikimedia.org/T329684
[11:44:18] <wikibugs>	 (03PS7) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646
[11:44:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[11:44:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[11:45:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45680 and previous config saved to /var/cache/conftool/dbconfig/20230309-114500-root.json
[11:45:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40052/console" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond)
[11:46:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8679314, @cmooney wrote: >>>! In T327919#8664016, @aborrero wrote: >> Please l...
[11:47:43] <marostegui>	 !log Deploy schema change on s1 codfw dbmaint T329684
[11:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P45681 and previous config saved to /var/cache/conftool/dbconfig/20230309-115031-marostegui.json
[11:51:10] <jinxer-wm>	 (ProbeDown) firing: (13) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:52:14] <wikibugs>	 (03PS3) 10Btullis: Add an entry in the service catalog for the aqs service running in codfw [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115)
[11:54:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T329203)', diff saved to https://phabricator.wikimedia.org/P45682 and previous config saved to /var/cache/conftool/dbconfig/20230309-115445-marostegui.json
[11:54:49] <jinxer-wm>	 (ProbeDown) firing: (34) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:54:51] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[11:56:46] <wikibugs>	 (03PS3) 10Btullis: Add forward and reverse entries for aqs.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115)
[11:58:00] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add forward and reverse entries for aqs.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis)
[12:00:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45683 and previous config saved to /var/cache/conftool/dbconfig/20230309-120005-root.json
[12:01:41] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10jbond)
[12:01:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] admin: dmartin now needs analytics-privatedata + krb [puppet] - 10https://gerrit.wikimedia.org/r/896046 (https://phabricator.wikimedia.org/T331500) (owner: 10MVernon)
[12:03:40] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] admin: dmartin now needs analytics-privatedata + krb [puppet] - 10https://gerrit.wikimedia.org/r/896046 (https://phabricator.wikimedia.org/T331500) (owner: 10MVernon)
[12:04:49] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert)
[12:05:00] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40056/console" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis)
[12:05:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45684 and previous config saved to /var/cache/conftool/dbconfig/20230309-120537-marostegui.json
[12:05:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance
[12:05:43] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[12:05:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance
[12:05:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45685 and previous config saved to /var/cache/conftool/dbconfig/20230309-120559-marostegui.json
[12:06:22] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @DMartin-WMF all done.
[12:06:45] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40057/console" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis)
[12:08:05] <wikibugs>	 (03PS2) 10Clément Goubert: Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363)
[12:08:19] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10MatthewVernon)
[12:09:08] <wikibugs>	 (03PS1) 10Jbond: promethus: move expose ssl certs to prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/896065
[12:09:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] promethus: move expose ssl certs to prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/896065 (owner: 10Jbond)
[12:09:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45686 and previous config saved to /var/cache/conftool/dbconfig/20230309-120951-marostegui.json
[12:13:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] cr-cloud: permit toolsdb return traffic to cloudcontrols (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/896051 (https://phabricator.wikimedia.org/T303663) (owner: 10Majavah)
[12:13:47] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "pybal and the alerting system doesn't support a cluster without any administratively pooled server AFAIK so it won't be happy cause aqs@co" [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis)
[12:15:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45687 and previous config saved to /var/cache/conftool/dbconfig/20230309-121510-root.json
[12:17:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45688 and previous config saved to /var/cache/conftool/dbconfig/20230309-121756-marostegui.json
[12:18:02] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[12:19:49] <jinxer-wm>	 (ProbeDown) firing: (68) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:20:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris)
[12:20:39] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: istio wikikube: Add the proper tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748
[12:21:29] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[12:22:58] <moritzm>	 !log rebalancing ganeti eqiad/C after completion of bullseye updates T311687
[12:23:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:03] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[12:24:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P45689 and previous config saved to /var/cache/conftool/dbconfig/20230309-122458-marostegui.json
[12:27:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] istio wikikube: Add the proper tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris)
[12:27:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] istio wikikube: Add the proper tolerations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris)
[12:30:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45690 and previous config saved to /var/cache/conftool/dbconfig/20230309-123015-root.json
[12:32:59] <wikibugs>	 (03Merged) 10jenkins-bot: istio wikikube: Add the proper tolerations [deployment-charts] - 10https://gerrit.wikimedia.org/r/895748 (owner: 10Alexandros Kosiaris)
[12:33:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P45691 and previous config saved to /var/cache/conftool/dbconfig/20230309-123303-marostegui.json
[12:40:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T329203)', diff saved to https://phabricator.wikimedia.org/P45692 and previous config saved to /var/cache/conftool/dbconfig/20230309-124004-marostegui.json
[12:40:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
[12:40:11] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[12:40:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance
[12:40:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T329203)', diff saved to https://phabricator.wikimedia.org/P45693 and previous config saved to /var/cache/conftool/dbconfig/20230309-124025-marostegui.json
[12:42:41] <wikibugs>	 (03PS1) 10Jbond: blackbox: allow sending raw bodies: [puppet] - 10https://gerrit.wikimedia.org/r/896082
[12:42:43] <wikibugs>	 (03PS1) 10Jbond: pki: use body_raw for check [puppet] - 10https://gerrit.wikimedia.org/r/896083
[12:43:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40058/console" [puppet] - 10https://gerrit.wikimedia.org/r/896083 (owner: 10Jbond)
[12:44:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] pki: use body_raw for check [puppet] - 10https://gerrit.wikimedia.org/r/896083 (owner: 10Jbond)
[12:46:20] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[12:47:04] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add forward and reverse entries for aqs.svc.codfw.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/894024 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis)
[12:47:06] <wikibugs>	 (03PS2) 10Jbond: blackbox: allow sending raw bodies: [puppet] - 10https://gerrit.wikimedia.org/r/896082
[12:48:01] <wikibugs>	 (03PS1) 10MarcoAurelio: eswikiversity: Enable SFS in enforce mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182)
[12:48:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P45694 and previous config saved to /var/cache/conftool/dbconfig/20230309-124809-marostegui.json
[12:49:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/895844 (owner: 10Slyngshede)
[12:49:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40060/console" [puppet] - 10https://gerrit.wikimedia.org/r/896082 (owner: 10Jbond)
[12:49:50] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment ldap servers must be a list. [puppet] - 10https://gerrit.wikimedia.org/r/895844 (owner: 10Slyngshede)
[12:50:06] <wikibugs>	 (03PS3) 10Nicolas Fraison: hadoop::hdfs: remove nrpe check file age on FSImage [puppet] - 10https://gerrit.wikimedia.org/r/896057 (https://phabricator.wikimedia.org/T331310)
[12:50:08] <wikibugs>	 (03PS3) 10Nicolas Fraison: hadoop:hdfs: fully remove FSImage nrpe check file age alert [puppet] - 10https://gerrit.wikimedia.org/r/896058 (https://phabricator.wikimedia.org/T331310)
[12:51:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40061/console" [puppet] - 10https://gerrit.wikimedia.org/r/896083 (owner: 10Jbond)
[12:53:21] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: name=aqs2001.codfw.wmnet
[12:55:20] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/weight=10; selector: cluster=aqs,dc=codfw
[12:55:26] <wikibugs>	 (03PS3) 10Jbond: blackbox: allow sending raw bodies: [puppet] - 10https://gerrit.wikimedia.org/r/896082
[12:55:31] <wikibugs>	 (03PS2) 10Jbond: pki: use body_raw for check [puppet] - 10https://gerrit.wikimedia.org/r/896083
[12:56:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40062/console" [puppet] - 10https://gerrit.wikimedia.org/r/896082 (owner: 10Jbond)
[12:57:17] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=aqs,dc=codfw
[12:58:02] <wikibugs>	 (03CR) 10Jelto: "left some feedback in-line" [puppet] - 10https://gerrit.wikimedia.org/r/895240 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[12:58:17] <wikibugs>	 (03CR) 10Nicolas Fraison: Update the spark-operator chart with consistent image versions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[12:59:34] <wikibugs>	 (03CR) 10Btullis: Add an entry in the service catalog for the aqs service running in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis)
[12:59:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison)
[13:00:09] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add an entry in the service catalog for the aqs service running in codfw [puppet] - 10https://gerrit.wikimedia.org/r/894017 (https://phabricator.wikimedia.org/T331115) (owner: 10Btullis)
[13:01:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] blackbox: allow sending raw bodies: [puppet] - 10https://gerrit.wikimedia.org/r/896082 (owner: 10Jbond)
[13:01:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: use body_raw for check [puppet] - 10https://gerrit.wikimedia.org/r/896083 (owner: 10Jbond)
[13:03:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T329260)', diff saved to https://phabricator.wikimedia.org/P45695 and previous config saved to /var/cache/conftool/dbconfig/20230309-130315-marostegui.json
[13:03:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[13:03:18] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: btullis-T331115 - btullis@cumin1001"
[13:03:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance
[13:03:21] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[13:03:29] <stashbot>	 T331115: Finalize the multi-dc configuration of AQS (nodejs) in codfw - https://phabricator.wikimedia.org/T331115
[13:04:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: btullis-T331115 - btullis@cumin1001"
[13:04:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:09:49] <jinxer-wm>	 (ProbeDown) firing: (68) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:11:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance
[13:11:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance
[13:11:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T329260)', diff saved to https://phabricator.wikimedia.org/P45696 and previous config saved to /var/cache/conftool/dbconfig/20230309-131136-marostegui.json
[13:11:42] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[13:12:24] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.12:7232]) https://wikitech.wikimedia.org/wiki/PyBal
[13:13:28] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.12:7232]) https://wikitech.wikimedia.org/wiki/PyBal
[13:14:00] <jbond>	 btullis: fyi ^^^
[13:14:16] <jbond>	 i think this relates to what vgutier.rez  mentioned
[13:14:34] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 71 connections established with conf2005.codfw.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal
[13:14:35] <btullis>	 jbond: Thanks. Looking now.
[13:14:40] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 89 connections established with conf2004.codfw.wmnet:4001 (min=90) https://wikitech.wikimedia.org/wiki/PyBal
[13:16:05] <vgutierrez>	 that's expected
[13:16:15] * vgutierrez taking care of it
[13:16:16] <btullis>	 Phew!
[13:16:18] <jinxer-wm>	 (ProbeDown) firing: Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aqs:7232 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:16:45] <vgutierrez>	 !incidents
[13:16:45] <sirenbot>	 3468 (ACKED)  ProbeDown (10.2.1.12 ip4 aqs:7232 probes/service http_aqs_ip4 codfw)
[13:16:45] <sirenbot>	 3467 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global (cr2-codfw.wikimedia.org)
[13:16:45] <sirenbot>	 3466 (RESOLVED)  SessionStoreErrorRateHigh (eqiad)
[13:17:20] <vgutierrez>	 !log rolling restart of pybal in lvs2009 and lvs2010
[13:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:19:20] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:19:27] <wikibugs>	 (03PS1) 10Kosta Harlan: changeprop: Add rules for notificationKeepGoingJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616)
[13:19:49] <jinxer-wm>	 (ProbeDown) firing: (69) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:19:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T329203)', diff saved to https://phabricator.wikimedia.org/P45697 and previous config saved to /var/cache/conftool/dbconfig/20230309-131951-marostegui.json
[13:19:57] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[13:20:30] <wikibugs>	 (03CR) 10Kosta Harlan: changeprop: Add rules for notificationKeepGoingJob (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616) (owner: 10Kosta Harlan)
[13:20:32] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 90 connections established with conf2004.codfw.wmnet:4001 (min=90) https://wikitech.wikimedia.org/wiki/PyBal
[13:21:18] <jinxer-wm>	 (ProbeDown) resolved: Service aqs:7232 has failed probes (http_aqs_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aqs:7232 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:22:22] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+2 C: 03+2] spark-operator: rely on exec entrypoint instead of shell one [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/896053 (https://phabricator.wikimedia.org/T318926) (owner: 10Nicolas Fraison)
[13:22:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:23:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T329260)', diff saved to https://phabricator.wikimedia.org/P45698 and previous config saved to /var/cache/conftool/dbconfig/20230309-132331-marostegui.json
[13:23:37] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[13:24:04] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:24:27] <wikibugs>	 (03PS2) 10Kosta Harlan: changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616)
[13:24:49] <jinxer-wm>	 (ProbeDown) resolved: (69) Service pki1001:443 has failed probes (http_aux_front_proxy_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:24:50] <wikibugs>	 (03PS3) 10Kosta Harlan: changeprop: Rules for notificationKeepGoingJob and notificationGetStartedJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/896091 (https://phabricator.wikimedia.org/T331616)
[13:26:14] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 72 connections established with conf2005.codfw.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal
[13:27:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: Topology changes
[13:27:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: Topology changes
[13:27:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Very nice! There are actually two places in our Puppet which use an unhashed lookup:" [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond)
[13:27:58] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[13:28:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "PCC is also fine https://puppet-compiler.wmflabs.org/output/895811/40063/" [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond)
[13:31:36] <icinga-wm>	 RECOVERY - Check systemd state on arclamp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:50] <icinga-wm>	 RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:30] <moritzm>	 !log installing curl security updates
[13:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45699 and previous config saved to /var/cache/conftool/dbconfig/20230309-133458-marostegui.json
[13:38:35] <wikibugs>	 (03PS4) 10Btullis: Update the spark-operator chart with consistent image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926)
[13:38:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P45700 and previous config saved to /var/cache/conftool/dbconfig/20230309-133837-marostegui.json
[13:42:01] <moritzm>	 !log restarting FPM/Apache on mw canaries to pick up curl updates
[13:42:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:52] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wi
[13:43:52] <icinga-wm>	 d
[13:45:44] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[13:46:47] <wikibugs>	 (03PS1) 10Btullis: Upgrade Airflon on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193)
[13:47:49] <wikibugs>	 (03PS2) 10Btullis: Upgrade Airflow on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193)
[13:49:35] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40064/console" [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[13:49:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Add urldownloader100[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/896099 (https://phabricator.wikimedia.org/T329945)
[13:49:48] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:50:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P45701 and previous config saved to /var/cache/conftool/dbconfig/20230309-135004-marostegui.json
[13:51:10] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:51:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add urldownloader100[34] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/896099 (https://phabricator.wikimedia.org/T329945) (owner: 10Muehlenhoff)
[13:52:32] <wikibugs>	 (03PS3) 10Btullis: Upgrade Airflow on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193)
[13:53:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P45702 and previous config saved to /var/cache/conftool/dbconfig/20230309-135343-marostegui.json
[13:53:51] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40065/console" [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[13:54:42] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Upgrade Airflow on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[13:54:58] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Upgrade Airflow on an-launcher1002 to version 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/896098 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[13:57:41] <wikibugs>	 (03PS1) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101
[13:58:01] <wikibugs>	 (03PS2) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970)
[13:58:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[13:58:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[14:00:04] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@9fba86b]: Upgrade to 2.5.1 from origin/T326194_airflow_deb_creation_with_gitlab_ci  [airflow-dags@9fba86b]
[14:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1400)
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1400)
[14:00:05] <jouncebot>	 duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:16] <wikibugs>	 (03PS3) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970)
[14:00:17] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@9fba86b]: Upgrade to 2.5.1 from origin/T326194_airflow_deb_creation_with_gitlab_ci  [airflow-dags@9fba86b] (duration: 00m 13s)
[14:00:28] <TheresNoTime>	 I can deploy
[14:00:29] <Lucas_WMDE>	 I’m in a meeting, sorry
[14:00:31] <Lucas_WMDE>	 yay
[14:00:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[14:01:21] <TheresNoTime>	 duesen: around? :)
[14:01:39] <wikibugs>	 (03PS3) 10Samtar: Bump parsoid parser cache writes to 50%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler)
[14:02:17] <wikibugs>	 (03PS4) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970)
[14:02:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[14:03:48] * TheresNoTime will await duesen 
[14:04:19] <duesen>	 TheresNoTime: hey!
[14:04:25] <TheresNoTime>	 o/
[14:04:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler)
[14:04:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10thcipriani) >>! In T330070#8667684, @MatthewVernon wrote: > @thcipriani can I ping you about this approval, please?  Yes, sorry for the delay :( — approved!
[14:05:04] <duesen>	 TheresNoTime: so... this is like the last couple of times. It just bumps a config variable, and the effect will become visible on grafana once it is hit by full traffic. Nothing to test.
[14:05:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T329203)', diff saved to https://phabricator.wikimedia.org/P45703 and previous config saved to /var/cache/conftool/dbconfig/20230309-140510-marostegui.json
[14:05:16] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[14:05:17] <wikibugs>	 (03PS5) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970)
[14:05:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF)
[14:05:23] <TheresNoTime>	 duesen: ack, okay thank you, will just run it through :)
[14:05:25] <wikibugs>	 (03PS4) 10Slyngshede: data.yaml add sgimeno to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070)
[14:05:27] <wikibugs>	 (03Merged) 10jenkins-bot: Bump parsoid parser cache writes to 50%. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886905 (https://phabricator.wikimedia.org/T320534) (owner: 10Daniel Kinzler)
[14:05:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[14:05:51] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:886905|Bump parsoid parser cache writes to 50%. (T320534)]]
[14:05:55] <duesen>	 TheresNoTime: i'll keep an eye on the dashboard
[14:05:56] <stashbot>	 T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534
[14:06:30] <wikibugs>	 10SRE-swift-storage, 10ConfirmEdit (CAPTCHA extension), 10Wikimedia-production-error: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T318941 (10TheresNoTime) Seeing a slight uptick (again) with these, recent:  ==== Error ====  * mwversion: 1.40.0-wmf.25 * reqId: 65b5c08f-f0ab-...
[14:06:38] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] data.yaml add sgimeno to deployment group. [puppet] - 10https://gerrit.wikimedia.org/r/890797 (https://phabricator.wikimedia.org/T330070) (owner: 10Slyngshede)
[14:07:33] <logmsgbot>	 !log samtar@deploy2002 daniel and samtar: Backport for [[gerrit:886905|Bump parsoid parser cache writes to 50%. (T320534)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[14:07:39] <TheresNoTime>	 syncing
[14:07:51] <wikibugs>	 (03PS6) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970)
[14:07:53] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff)
[14:08:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Yes, please. I've unmounted a drive in ms-be1066 and turned on the locator light `sudo megacli -PDLocate -PhysDrv [32:15] -a0`  So please go ahead.
[14:08:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF
[14:08:36] <Emperor>	 !log testing disk-swap in ms-be1066 T329305
[14:08:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:41] <stashbot>	 T329305: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305
[14:08:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T329260)', diff saved to https://phabricator.wikimedia.org/P45704 and previous config saved to /var/cache/conftool/dbconfig/20230309-140850-marostegui.json
[14:08:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance
[14:08:55] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[14:09:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance
[14:09:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[14:09:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[14:09:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T329260)', diff saved to https://phabricator.wikimedia.org/P45705 and previous config saved to /var/cache/conftool/dbconfig/20230309-140915-marostegui.json
[14:10:33] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[14:11:23] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [restbase/deploy@f774711]: (no justification provided)
[14:12:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:13:20] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:886905|Bump parsoid parser cache writes to 50%. (T320534)]] (duration: 07m 28s)
[14:13:25] <stashbot>	 T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534
[14:13:48] <TheresNoTime>	 duesen: that's now live
[14:14:34] <TheresNoTime>	 (out of curiosity, which dashboard will reflect these changes?)
[14:15:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8679398, @aborrero wrote: > In the past we had problems with DHCP forwarding be...
[14:17:05] <duesen>	 TheresNoTime: https://grafana-rw.wikimedia.org/d/OxxOv5K4k/ve-backend-dashboard?forceLogin&from=now-1h&orgId=1&refresh=30s&to=now&viewPanel=11
[14:17:30] <duesen>	 TheresNoTime: the green area and the grey should eventually be roughly the same size
[14:17:52] <duesen>	 TheresNoTime: the split was 80/20 before, should be 50/50 now. Looks like it's getting there.
[14:17:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:18:44] <TheresNoTime>	 nice :D
[14:19:12] * TheresNoTime will be around for the next 30m if there's any other patches o/
[14:19:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Something has gone a bit awry, the kernel reports problems with two other drives instead: ` Mar  9 14:13:57 ms-be1066 kernel: [11683056.185701] sd 0:2:4:0: [sdf] tag#699 FAILED R...
[14:19:51] <duesen>	 TheresNoTime: thank you
[14:19:53] <wikibugs>	 (03PS7) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970)
[14:20:24] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[14:22:25] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10Jhancock.wm) @cmooney I got these repatched as depicted in the links. Thanks for waiting. Please let me know if you need anything else!
[14:22:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend dumps alias [puppet] - 10https://gerrit.wikimedia.org/r/895751 (owner: 10Muehlenhoff)
[14:23:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] slapd: Add support to configure MDB storage backend [puppet] - 10https://gerrit.wikimedia.org/r/881422 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff)
[14:29:49] <wikibugs>	 (03PS5) 10Btullis: Update the spark-operator chart with consistent image details [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926)
[14:30:26] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@f774711]: (no justification provided) (duration: 19m 03s)
[14:30:32] <wikibugs>	 (03PS8) 10FNegri: [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970)
[14:30:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney)
[14:30:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Looking at these drives -  ` sdz is bus info: scsi@0:2.25.0 Target Id: 25 is Enclosure Device ID: 32 Slot Number: 23 `  ` sdf is still absent but scsi@0:2.17.0 is missing Target...
[14:30:59] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw - https://phabricator.wikimedia.org/T331470 (10cmooney) 05Open→03Resolved That's great Jenn thanks!  All looking good and working now :) ` cmooney@cloudsw1-b1-codfw> show interfaces descrip...
[14:31:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance
[14:31:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance
[14:32:38] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm. I resolved on one of my in-line comments after checking the migration of files to the config modules should be noop, because this mo" [puppet] - 10https://gerrit.wikimedia.org/r/895240 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[14:33:03] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[14:33:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Target Id 4 also missing
[14:34:11] <wikibugs>	 (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[14:34:19] <moritzm>	 !log installing apr security updates
[14:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10Jclark-ctr) slot 2 is right by the handle. possibly
[14:35:59] <wikibugs>	 (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[14:36:06] <wikibugs>	 (03PS1) 10Daniel Kinzler: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104
[14:39:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10Jclark-ctr) Replaced drive slot 15 with test drive
[14:44:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) Can you check the drives in slots 23 and 2 are seated proper please? the kernel still can't see them.
[14:48:57] <wikibugs>	 (03CR) 10Tacsipacsi: [C: 03+1] Drop unused FlaggedRevs threshold level names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight)
[14:51:20] <zabe>	 jouncebot: nowandnext
[14:51:20] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1400)
[14:51:20] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1400)
[14:51:20] <jouncebot>	 In 2 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1700)
[14:52:10] <wikibugs>	 (03PS6) 10Zabe: Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight)
[14:52:12] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight)
[14:52:21] <wikibugs>	 (03PS38) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:52:23] <wikibugs>	 (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:52:55] <wikibugs>	 (03Merged) 10jenkins-bot: Drop unused FlaggedRevs threshold level names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight)
[14:53:30] <wikibugs>	 (03PS39) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:54:02] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:790707|Drop unused FlaggedRevs threshold level names (T277883)]]
[14:54:08] <stashbot>	 T277883: Drop all low-use and unused features of FlaggedRevs to make it more maintainable - https://phabricator.wikimedia.org/T277883
[14:54:16] <wikibugs>	 (03PS6) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663)
[14:54:27] <wikibugs>	 (03PS2) 10David Caro: maintain-dbusers: skip tool accounts that are not ready [puppet] - 10https://gerrit.wikimedia.org/r/895838 (https://phabricator.wikimedia.org/T303663)
[14:55:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[14:55:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:55:41] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[14:55:52] <logmsgbot>	 !log zabe@deploy2002 awight and zabe: Backport for [[gerrit:790707|Drop unused FlaggedRevs threshold level names (T277883)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[14:56:00] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[14:56:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro)
[14:57:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: skip tool accounts that are not ready [puppet] - 10https://gerrit.wikimedia.org/r/895838 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro)
[14:58:01] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] dns::auth: deprecate role and update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[14:58:26] <wikibugs>	 (03PS1) 10Slyngshede: R:idp_test create development service [puppet] - 10https://gerrit.wikimedia.org/r/896109
[15:00:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[15:00:09] <wikibugs>	 (03PS40) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:00:12] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[15:00:56] <wikibugs>	 (03CR) 10Tacsipacsi: Drop unused FlaggedRevs threshold level names (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight)
[15:01:18] <wikibugs>	 (03CR) 10Slyngshede: "Do you see any security implications of having a service that allows callback to be directed to localhost? It would be really helpful to j" [puppet] - 10https://gerrit.wikimedia.org/r/896109 (owner: 10Slyngshede)
[15:01:40] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[15:01:43] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[15:02:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: m5 master switch T330847
[15:02:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:02:06] <stashbot>	 T330847: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847
[15:02:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: m5 master switch T330847
[15:02:47] <wikibugs>	 (03PS1) 10Muehlenhoff: slapd: correct module loading [puppet] - 10https://gerrit.wikimedia.org/r/896110 (https://phabricator.wikimedia.org/T292942)
[15:02:51] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[15:03:10] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[15:03:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1176 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847) (owner: 10Marostegui)
[15:03:42] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1176 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/895910 (https://phabricator.wikimedia.org/T330847)
[15:04:08] <TheresNoTime>	 !log close UTC afternoon backport window
[15:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:14] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[15:04:43] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/896110 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff)
[15:04:45] <wikibugs>	 10SRE, 10ops-codfw, 10Data-Persistence (work done), 10decommission-hardware: decommission db2093.codfw.wmnet - https://phabricator.wikimedia.org/T330827 (10Jhancock.wm) 05Open→03Resolved
[15:04:51] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:790707|Drop unused FlaggedRevs threshold level names (T277883)]] (duration: 10m 48s)
[15:04:55] <stashbot>	 T277883: Drop all low-use and unused features of FlaggedRevs to make it more maintainable - https://phabricator.wikimedia.org/T277883
[15:05:10] <wikibugs>	 (03CR) 10Herron: [C: 03+1] centrallog: Add centrallog1002 as the kafkatee active host [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[15:05:33] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, and 2 others: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[15:06:53] <brett>	 !log Disable puppet on R:acme_chief::cert for acmechief maintenance - T321309
[15:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:57] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[15:07:25] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM pending fix for commit msg typo flagged by filippo" [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[15:07:54] <wikibugs>	 (03CR) 10Herron: [C: 03+1] rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[15:08:10] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[15:08:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] acmechief: Set acmechief2001 as active [puppet] - 10https://gerrit.wikimedia.org/r/895860 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[15:08:27] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] acmechief: Set acmechief2001 as active [puppet] - 10https://gerrit.wikimedia.org/r/895860 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[15:09:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T329260)', diff saved to https://phabricator.wikimedia.org/P45706 and previous config saved to /var/cache/conftool/dbconfig/20230309-150940-marostegui.json
[15:09:46] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[15:10:02] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[15:10:04] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[15:10:08] <icinga-wm>	 PROBLEM - Disk space on urldownloader2001 is CRITICAL: DISK CRITICAL - free space: / 332 MB (3% inode=81%): /tmp 332 MB (3% inode=81%): /var/tmp 332 MB (3% inode=81%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader2001&var-datasource=codfw+prometheus/ops
[15:10:36] <wikibugs>	 (03PS1) 10JMeybohm: cert-manager: Enable stable certificate request names in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/896111 (https://phabricator.wikimedia.org/T304092)
[15:10:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance
[15:10:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2163.codfw.wmnet with reason: Maintenance
[15:10:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:11:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T329203)', diff saved to https://phabricator.wikimedia.org/P45707 and previous config saved to /var/cache/conftool/dbconfig/20230309-151100-marostegui.json
[15:11:05] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[15:11:35] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng: Add default-network-policy globally [deployment-charts] - 10https://gerrit.wikimedia.org/r/893018 (https://phabricator.wikimedia.org/T275035)
[15:11:40] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[15:11:43] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[15:12:03] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[15:13:13] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] [toolsdb] Update config file but keep old one [puppet] - 10https://gerrit.wikimedia.org/r/896101 (https://phabricator.wikimedia.org/T329970) (owner: 10FNegri)
[15:13:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) [after a reboot the drive in slot 2 was in a "Foreign" state; clearing that made it possible to reintroduce it with `sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0`...
[15:13:57] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[15:14:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[15:14:04] <icinga-wm>	 PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: reload-acme-chief-backend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:34] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for codfw cr links to cloudsw-b1-codfw. - cmooney@cumin1001"
[15:14:47] <vgutierrez>	 acmechief1001 alert is expected
[15:15:04] <moritzm>	 !log installing PHP 7.3 security updates (as shipped in Debian)
[15:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:41] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for codfw cr links to cloudsw-b1-codfw. - cmooney@cumin1001"
[15:15:41] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:15:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:16:31] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[15:16:55] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) All the pre-failover steps are done. Waiting for 16:00 UTC to perform the actual switch.
[15:17:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Cumin aliases for IDM [puppet] - 10https://gerrit.wikimedia.org/r/896112 (https://phabricator.wikimedia.org/T320797)
[15:19:03] <wikibugs>	 (03PS11) 10Bking: rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[15:19:38] <wikibugs>	 (03PS1) 10BCornwall: hieradata/common: acmechief_host as acmechief2001 [puppet] - 10https://gerrit.wikimedia.org/r/896114 (https://phabricator.wikimedia.org/T321309)
[15:20:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hieradata/common: acmechief_host as acmechief2001 [puppet] - 10https://gerrit.wikimedia.org/r/896114 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[15:20:37] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hieradata/common: acmechief_host as acmechief2001 [puppet] - 10https://gerrit.wikimedia.org/r/896114 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[15:21:34] <wikibugs>	 (03PS41) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:21:45] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[15:23:22] <wikibugs>	 (03PS1) 10David Caro: replica_cnf: return skip if the account already exists [puppet] - 10https://gerrit.wikimedia.org/r/896115
[15:23:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:24:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P45709 and previous config saved to /var/cache/conftool/dbconfig/20230309-152447-marostegui.json
[15:25:35] <wikibugs>	 (03CR) 10JMeybohm: Exclude traindev from tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 (owner: 10Clément Goubert)
[15:26:14] <wikibugs>	 (03PS1) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116
[15:26:16] <wikibugs>	 (03PS1) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117
[15:26:28] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 04-2] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (owner: 10Nicolas Fraison)
[15:26:32] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 04-2] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/896117 (owner: 10Nicolas Fraison)
[15:26:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (owner: 10Nicolas Fraison)
[15:26:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (owner: 10Nicolas Fraison)
[15:26:59] <wikibugs>	 (03PS1) 10Zabe: switch noc.wikimedia.org from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/896118 (https://phabricator.wikimedia.org/T331634)
[15:27:12] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 04-1] "Let us first debug the etag breakage before we make this change. We don't want this to hide a bug only to resurface when we disable RESTBa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (owner: 10Daniel Kinzler)
[15:27:51] <brett>	 !log Enable puppet on R:acme_chief::cert - T321309
[15:27:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:56] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[15:28:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon) The swapped-in drive seems OK initially, I'll get swift to start using it shortly.
[15:29:00] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[15:29:05] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.ganeti.reimage for host acmechief1001.eqiad.wmnet with OS bullseye
[15:29:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host acmechief1001.eqiad.wmnet with OS bullseye
[15:30:11] <icinga-wm>	 RECOVERY - Disk space on urldownloader2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader2001&var-datasource=codfw+prometheus/ops
[15:30:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM.  Rule makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/896052 (https://phabricator.wikimedia.org/T272585) (owner: 10Arturo Borrero Gonzalez)
[15:30:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Do you mind removing the k8s version conditionals again? All clusters are on 1.23 and as of I77657a2674a4546aa5088660745f09eedd5d2201" [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[15:31:01] <wikibugs>	 (03PS2) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151)
[15:31:03] <wikibugs>	 (03PS2) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151)
[15:31:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:32:36] <wikibugs>	 (03PS32) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[15:32:38] <wikibugs>	 (03PS3) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151)
[15:32:40] <wikibugs>	 (03PS3) 10Nicolas Fraison: osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151)
[15:32:42] <wikibugs>	 (03PS42) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:33:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:33:13] <wikibugs>	 (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[15:33:19] <wikibugs>	 (03CR) 10Nicolas Fraison: Configure the new ceph servers with mon and mgr daemons (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/887419 (https://phabricator.wikimedia.org/T328123) (owner: 10Btullis)
[15:33:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: create osd [puppet] - 10https://gerrit.wikimedia.org/r/896117 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:34:36] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[15:34:39] <icinga-wm>	 PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:35:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[15:35:52] <vgutierrez>	 ^^ expected while acmechief1001 is being reimaged
[15:36:52] <wikibugs>	 (03CR) 10JMeybohm: "I'd say +1 but this needs rebase after the 1.23 upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[15:36:56] <wikibugs>	 (03PS2) 10Andrea Denisse: centrallog: Remove centrallog1001 from the kafka-jumbo allow list [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803)
[15:37:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) Some updates on the physicals for the new cloudsw.  The links to core routers are now up and c...
[15:39:07] <marostegui>	 In 20 minutes I am switching over m5 db master, which will affect toolhub, mailman and some other WMCS related databases. Impact: RO for around 1 minute, reads unaffected https://phabricator.wikimedia.org/T330847
[15:39:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P45710 and previous config saved to /var/cache/conftool/dbconfig/20230309-153953-marostegui.json
[15:40:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2163 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45711 and previous config saved to /var/cache/conftool/dbconfig/20230309-154053-root.json
[15:44:41] <wikibugs>	 (03CR) 10Nray: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray)
[15:44:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] replica_cnf: return skip if the account already exists [puppet] - 10https://gerrit.wikimedia.org/r/896115 (owner: 10David Caro)
[15:44:58] <wikibugs>	 (03CR) 10Herron: "thanks! the updates lgtm in general, please see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond)
[15:46:15] <wikibugs>	 (03CR) 10BCornwall: codesearch: Change systemd Requires= to BindsTo= (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)
[15:54:04] <wikibugs>	 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10hashar) I am guessing it is an issue with Mailman. https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 shows a large queue **since March 7th 14:12**:...
[15:55:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T329260)', diff saved to https://phabricator.wikimedia.org/P45712 and previous config saved to /var/cache/conftool/dbconfig/20230309-155459-marostegui.json
[15:55:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance
[15:55:06] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[15:55:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance
[15:55:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T329260)', diff saved to https://phabricator.wikimedia.org/P45713 and previous config saved to /var/cache/conftool/dbconfig/20230309-155520-marostegui.json
[15:55:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2163 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45714 and previous config saved to /var/cache/conftool/dbconfig/20230309-155558-root.json
[15:56:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add check_dns_state to service.Service (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto)
[15:56:25] <wikibugs>	 10SRE, 10ops-eqsin, 10ops-ulsfo, 10DC-Ops: eqsin & ulsfo: new R450s drawing far more power than R440s (power over contracted caps in both sites) - https://phabricator.wikimedia.org/T328957 (10RobH) We chatted about this during the last knams sync up call, as our racks there have a higher cap due to this....
[15:57:08] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655
[15:57:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond)
[15:57:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518 (10jbond)
[15:57:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10jbond) 05Open→03Resolved a:03jbond
[15:58:01] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: add optional docker registry proxy to runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894100 (https://phabricator.wikimedia.org/T329679) (owner: 10Jelto)
[15:59:47] <wikibugs>	 (03PS1) 10Zabe: noc: Publicly expose EventStreamConfig settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896121 (https://phabricator.wikimedia.org/T308932)
[15:59:52] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 03+1] Update the spark-operator chart with consistent image details [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[15:59:57] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] noc: Publicly expose EventStreamConfig settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896121 (https://phabricator.wikimedia.org/T308932) (owner: 10Zabe)
[16:00:09] <marostegui>	 !log Failover m5 from db1183 to db1176 - T330847
[16:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:16] <icinga-wm>	 RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:17] <stashbot>	 T330847: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847
[16:00:33] <marostegui>	 bd808: all done
[16:00:47] <wikibugs>	 (03Merged) 10jenkins-bot: noc: Publicly expose EventStreamConfig settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896121 (https://phabricator.wikimedia.org/T308932) (owner: 10Zabe)
[16:01:00] <marostegui>	 Around 15 seconds RO
[16:01:15] <bd808>	 Brutal ;)
[16:01:32] <bd808>	 striker is working as expected.
[16:01:45] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] centrallog: Add centrallog1002 as the kafkatee active host [puppet] - 10https://gerrit.wikimedia.org/r/895902 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[16:01:46] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief1001.eqiad.wmnet with reason: host reimage
[16:01:58] <bd808>	 toolhub looks good too
[16:02:20] <logmsgbot>	 !log zabe@deploy2002 Started scap: T308932
[16:02:25] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[16:02:29] <marostegui>	 bd808: including writes?
[16:02:45] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[16:02:53] <wikibugs>	 (03Abandoned) 10David Caro: maintain-dbusers: skip tool accounts that are not ready [puppet] - 10https://gerrit.wikimedia.org/r/895838 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro)
[16:02:59] <marostegui>	 !log Restart mailman service T331626
[16:03:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:04] <stashbot>	 T331626: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626
[16:03:05] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] rsyslog: Remove centrallog1001 as TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/890884 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[16:03:07] <bd808>	 marostegui: yes, on both
[16:03:16] <marostegui>	 bd808: \o(
[16:03:21] <icinga-wm>	 PROBLEM - Check systemd state on acmechief2001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:22] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] centrallog: Remove centrallog1001 from the kafka-jumbo allow list [puppet] - 10https://gerrit.wikimedia.org/r/895898 (https://phabricator.wikimedia.org/T328803) (owner: 10Andrea Denisse)
[16:03:25] <marostegui>	 bd808: we are done then!
[16:03:28] <marostegui>	 thanks for being around 
[16:03:47] <bd808>	 thank you for doing the needful
[16:04:42] <wikibugs>	 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10hashar) Icinga says OK: mailman3 queues are below the limits, but there is an alert about the runners:  PROCS CRITICAL: 13 processes with UID = 38 (...
[16:04:58] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief1001.eqiad.wmnet with reason: host reimage
[16:05:18] <wikibugs>	 (03PS1) 10MVernon: swift: bring ms-be1066 sdr1 back into service [puppet] - 10https://gerrit.wikimedia.org/r/896124 (https://phabricator.wikimedia.org/T329305)
[16:06:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Testing Out Hard Drive on Swift Server - https://phabricator.wikimedia.org/T329305 (10MatthewVernon)
[16:06:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T329260)', diff saved to https://phabricator.wikimedia.org/P45715 and previous config saved to /var/cache/conftool/dbconfig/20230309-160630-marostegui.json
[16:06:36] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[16:06:51] <wikibugs>	 (03PS2) 10MVernon: admin: update sbassett ssh key [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554)
[16:09:39] <logmsgbot>	 !log zabe@deploy2002 Finished scap: T308932 (duration: 07m 19s)
[16:09:44] <stashbot>	 T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022)    - https://phabricator.wikimedia.org/T308932
[16:09:47] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui)
[16:10:05] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui)
[16:10:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) After pooling again and looking into the Swift logs, we realise...
[16:10:27] <wikibugs>	 10SRE, 10DBA, 10Toolhub, 10Wikimedia-Mailing-lists, 10cloud-services-team: Switchover m5 master (db1183 -> db1176) - https://phabricator.wikimedia.org/T330847 (10Marostegui) 05Open→03Resolved This was done, the RO time was around 15 seconds. Thanks @bd808 for the support!
[16:10:41] <wikibugs>	 (03PS7) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663)
[16:11:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2163 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45716 and previous config saved to /var/cache/conftool/dbconfig/20230309-161103-root.json
[16:15:30] <wikibugs>	 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10Marostegui) It looks like the restart I made fixed it or at least it is slowly going down: https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=...
[16:16:13] <wikibugs>	 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10hashar) 05Open→03Resolved a:03hashar Mail should be emitted again, it will take a bit of time to clear the queue though. That can be monitored...
[16:16:21] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10Aklapper) p:05Triage→03Unbreak! Potential regression from {T329073}, similar to {T331626}
[16:18:18] <icinga-wm>	 RECOVERY - Check systemd state on acmechief2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:32] <wikibugs>	 (03PS1) 10JMeybohm: Revert: cert-manager: Disable seccomProfile for k8s 1.16 compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/896128 (https://phabricator.wikimedia.org/T325292)
[16:18:34] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host acmechief1001.eqiad.wmnet with OS bullseye
[16:18:43] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host acmechief1001.eqiad.wmnet with OS bullseye completed: - acmechief1001 (**PASS**)   - Downtimed on Icinga/Alertmanager...
[16:21:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P45717 and previous config saved to /var/cache/conftool/dbconfig/20230309-162137-marostegui.json
[16:23:32] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[16:24:09] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui)
[16:25:00] <wikibugs>	 (03PS1) 10JMeybohm: Migrate away from deprecated typology annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066)
[16:25:08] <wikibugs>	 (03CR) 10Dzahn: "Could you please coordinate with serviceops on this one" [dns] - 10https://gerrit.wikimedia.org/r/896118 (https://phabricator.wikimedia.org/T331634) (owner: 10Zabe)
[16:26:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2163 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45718 and previous config saved to /var/cache/conftool/dbconfig/20230309-162608-root.json
[16:26:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance
[16:27:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance
[16:27:30] <wikibugs>	 (03CR) 10Dzahn: "I think https://phabricator.wikimedia.org/project/members/3158/ might be a better match than git blame in this case." [puppet] - 10https://gerrit.wikimedia.org/r/895884 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)
[16:28:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the spark-operator chart with consistent image details [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[16:28:56] <icinga-wm>	 RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:29:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the spark-operator chart with consistent image details (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[16:31:13] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10Marostegui) Probably because of T331626 which is already fixed and recovering. It will take a bit until the queue gets emptied but the trend looks good: https://grafana.wikimedia.org/d/Gv...
[16:32:07] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10Marostegui) For the record: looks like the restart fixed it (T331626#8680413)
[16:33:39] <wikibugs>	 (03Merged) 10jenkins-bot: Update the spark-operator chart with consistent image details [deployment-charts] - 10https://gerrit.wikimedia.org/r/895842 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[16:36:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P45719 and previous config saved to /var/cache/conftool/dbconfig/20230309-163643-marostegui.json
[16:36:48] <herzog>	 marostegui: re mailman, will old messages and moderation notices be relied to the recipients or are they lost forever?
[16:37:46] <RhinosF1>	 herzog: they are being very slowly delivered
[16:37:54] <wikibugs>	 (03PS1) 10JMeybohm: custom_deploy.d: Make k8s 1.23 istio configs the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/896131 (https://phabricator.wikimedia.org/T328291)
[16:40:04] <marostegui>	 herzog: they should arrive when the queue gets processed 
[16:40:30] <herzog>	 thanks marostegui & RhinosF1
[16:42:28] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/896130 (https://phabricator.wikimedia.org/T325066) (owner: 10JMeybohm)
[16:42:55] <wikibugs>	 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10hashar) >>! In T331626#8680354, @hashar wrote: > PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner' > Last...
[16:43:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Cmjohnson) I received an idrac error on 3 of these hosts, I confirmed with Jeff that he is not able to access the host.  I am going to try and update t...
[16:47:59] <wikibugs>	 (03PS1) 10Subramanya Sastry: Revert "TransformHandler: Load stashed page bundle based on ETag." [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629)
[16:49:02] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10MatthewVernon) p:05Unbreak!→03Medium
[16:49:16] <wikibugs>	 (03PS1) 10Cwhite: logstash: mediawiki_ecs copy http_method into place [puppet] - 10https://gerrit.wikimedia.org/r/895739
[16:49:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10MatthewVernon) Setting to medium priority, because this is probably now just a case of waiting for the queue to drain.
[16:50:21] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:51:46] <topranks>	 !log Add EBGP peering from cr1-codfw to cloudsw1-b1-codfw (cloud vrf) T327919
[16:51:49] <wikibugs>	 10SRE, 10ops-eqiad: anworker1132 BBU issue/replacement - https://phabricator.wikimedia.org/T331543 (10Cmjohnson) @RhinosF1 Do I still need to troubleshoot the BBU or is no longer needed?
[16:51:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T329260)', diff saved to https://phabricator.wikimedia.org/P45720 and previous config saved to /var/cache/conftool/dbconfig/20230309-165149-marostegui.json
[16:51:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:51] <stashbot>	 T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919
[16:51:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance
[16:51:55] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[16:51:58] <wikibugs>	 (03PS1) 10JMeybohm: Move default kubernetes version to 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291)
[16:52:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance
[16:52:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T329260)', diff saved to https://phabricator.wikimedia.org/P45721 and previous config saved to /var/cache/conftool/dbconfig/20230309-165210-marostegui.json
[16:52:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Cmjohnson) Failed install but I didn't change the raid controller.
[16:52:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson
[16:55:01] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container version to 2023-03-06-121941-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/896135
[16:55:02] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:56:01] <zabe>	 jouncebot: nowandnext
[16:56:01] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 3 minute(s)
[16:56:01] <jouncebot>	 In 0 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1700)
[16:56:13] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Revert "TransformHandler: Load stashed page bundle based on ETag." [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629) (owner: 10Subramanya Sastry)
[16:56:37] <wikibugs>	 (03PS1) 10Btullis: Remove the install-crds parameter frlom spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/896137 (https://phabricator.wikimedia.org/T315486)
[16:56:40] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:58:42] <wikibugs>	 (03CR) 10Nicolas Fraison: [C: 03+1] Remove the install-crds parameter frlom spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/896137 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis)
[16:58:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2023-03-09-005633-production [puppet] - 10https://gerrit.wikimedia.org/r/895892 (https://phabricator.wikimedia.org/T330421) (owner: 10BryanDavis)
[16:58:50] <wikibugs>	 (03PS43) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[17:00:04] <jouncebot>	 jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:18] <wikibugs>	 (03PS2) 10JMeybohm: Move default kubernetes version to 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291)
[17:00:25] <TheresNoTime>	 seeing intermittent phabricator issues (`Unable to establish a connection to any database host (while trying "phabricator_spaces"). All masters and replicas are completely unreachable. AphrontConnectionLostQueryException: #2006: MySQL server has gone away`)
[17:00:58] <marostegui>	 TheresNoTime: let me check the DBs 
[17:01:35] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove the install-crds parameter frlom spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/896137 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis)
[17:02:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T329260)', diff saved to https://phabricator.wikimedia.org/P45722 and previous config saved to /var/cache/conftool/dbconfig/20230309-170205-marostegui.json
[17:02:11] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[17:02:20] <TheresNoTime>	 (fwiw, twice in ~10 minutes, persisted a few minutes each time)
[17:02:23] <marostegui>	 TheresNoTime: Everything seems to be working fine
[17:02:33] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40066/console" [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[17:02:46] <TheresNoTime>	 thank you for looking :)
[17:03:12] <wikibugs>	 10SRE, 10ops-eqiad: anworker1132 BBU issue/replacement - https://phabricator.wikimedia.org/T331543 (10Cmjohnson) 05Stalled→03Resolved issue turned out to be no issue, resolving the task
[17:03:16] <marostegui>	 TheresNoTime: Yeah, the graphs also do not show any weird patterns 
[17:03:18] <dancy>	 marostegui: There is at least one other person who experienced that too.
[17:03:35] <wikibugs>	 (03PS1) 10Cwhite: logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565)
[17:03:40] <dancy>	 Solar flare
[17:03:59] <marostegui>	 Maybe the frontend is having issues? 
[17:03:59] <wikibugs>	 (03PS2) 10Cwhite: logstash: mediawiki_ecs copy http_method into place [puppet] - 10https://gerrit.wikimedia.org/r/895739 (https://phabricator.wikimedia.org/T234565)
[17:04:00] <btullis>	 marostegui: Confirmed. We've (DE team) also seen transient MySQL errors from phab. Not many, but some.
[17:04:01] <marostegui>	 mutante ^
[17:04:50] <marostegui>	 I can't see anything wrong with the master and the graphs are looking healthy as well
[17:05:00] <TheresNoTime>	 ah good to see some of phabricators' "funny" error messages are still around - now getting `Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL).`
[17:05:08] <dancy>	 hehe
[17:05:25] <dancy>	 I do like that better than the usual dry messages
[17:05:29] <brennen>	 some things do seem a bit slow to load.
[17:05:37] <TheresNoTime>	 (and gone, so it's very intermittent, whatever it is..)
[17:05:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[17:06:05] <marostegui>	 It is indeed very slow
[17:06:16] <marostegui>	 mutante: you around to check the frontend?
[17:06:35] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: mediawiki_ecs copy http_method into place [puppet] - 10https://gerrit.wikimedia.org/r/895739 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[17:07:00] <brennen>	 logs on phab are fairly spammy in general at normal times, but i'm seeing some "AphrontConnectionLostQueryException: #2006: MySQL server has gone away".
[17:07:06] <wikibugs>	 (03Merged) 10jenkins-bot: Remove the install-crds parameter frlom spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/896137 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis)
[17:07:33] <wikibugs>	 (03PS2) 10Cwhite: logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565)
[17:07:45] <marostegui>	 brennen: That could be cause the connection has been hanging for a while, and when it tries to re-use that one, it is gone
[17:08:12] <marostegui>	 I am checking the proxy too
[17:09:42] <marostegui>	 I have seen some errors on haproxy, I have reloaded it to see if they clear
[17:10:54] <marostegui>	 Yeah, I think it was haproxy
[17:11:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10Cmjohnson) This server is out of warranty, I am not sure if we have any spares or a battery we can swap from a decom host. I'll update the task with more info after talking with @Jclark...
[17:12:03] <wikibugs>	 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Milimetric)
[17:12:13] <marostegui>	 brennen btullis TheresNoTime dancy let me know if you keep seeing them, the haproxy error is now gone
[17:12:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "TransformHandler: Load stashed page bundle based on ETag." [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629) (owner: 10Subramanya Sastry)
[17:12:28] <brennen>	 marostegui: ack, thanks.
[17:12:34] <TheresNoTime>	 okay :) thanks again
[17:13:15] <topranks>	 !log Add EBGP peering from cr1-codfw to cloudsw1-b1-codfw (prod links) T327919
[17:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:21] <stashbot>	 T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919
[17:15:13] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] "recheck" [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629) (owner: 10Subramanya Sastry)
[17:16:21] <wikibugs>	 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) Sorry, these two patches are unrelated to this patch. Added by mistake.
[17:17:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P45723 and previous config saved to /var/cache/conftool/dbconfig/20230309-171711-marostegui.json
[17:22:58] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:26:36] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] Sync more clamd.conf settings from 0.103.8 [puppet] - 10https://gerrit.wikimedia.org/r/895815 (https://phabricator.wikimedia.org/T330129) (owner: 10Muehlenhoff)
[17:29:12] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:10] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "TransformHandler: Load stashed page bundle based on ETag." [core] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/896030 (https://phabricator.wikimedia.org/T331629) (owner: 10Subramanya Sastry)
[17:31:10] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[17:32:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P45724 and previous config saved to /var/cache/conftool/dbconfig/20230309-173217-marostegui.json
[17:33:07] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: fix secret location, attempt three [labs/private] - 10https://gerrit.wikimedia.org/r/896141
[17:33:37] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2 C: 03+2] aux-k8s: fix secret location, attempt three [labs/private] - 10https://gerrit.wikimedia.org/r/896141 (owner: 10JHathaway)
[17:36:10] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[17:36:24] <sukhe>	 !log cr1-eqiad: set routing-options static route 208.80.154.238/32 next-hop 208.80.154.10
[17:36:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:34] <sukhe>	 !log cr1-eqiad: set routing-options static route 208.80.154.238/32 next-hop 208.80.154.10: T330670
[17:37:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:38] <stashbot>	 T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670
[17:37:39] <sukhe>	 !log cr2-eqiad: set routing-options static route 208.80.154.238/32 next-hop 208.80.154.10: T330670
[17:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:05] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:896030|Revert "TransformHandler: Load stashed page bundle based on ETag." (T331629)]]
[17:38:09] <stashbot>	 T331629: HTTP 412 Errors when editing Officewiki - https://phabricator.wikimedia.org/T331629
[17:38:52] <wikibugs>	 (03PS11) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554)
[17:38:58] <wikibugs>	 (03CR) 10Phedenskog: [C: 03+1] Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray)
[17:39:46] <logmsgbot>	 !log zabe@deploy2002 zabe and ssastry: Backport for [[gerrit:896030|Revert "TransformHandler: Load stashed page bundle based on ETag." (T331629)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[17:39:58] <zabe>	 subbu: is there a good way to test this patch?
[17:40:06] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] custom_deploy.d: Make k8s 1.23 istio configs the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/896131 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[17:40:31] <subbu>	 I can try to do a bunch of edits and verify if they pass or give me a 412.
[17:40:44] <subbu>	 not a robust test but better than nothing.
[17:40:51] <zabe>	 ok
[17:41:38] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:41:48] <subbu>	 is it on deug then?
[17:41:51] <subbu>	 let me test.
[17:41:58] <zabe>	 yes
[17:42:36] <sukhe>	 !log [ns1] set routing-options static route 208.80.153.231/32 next-hop 208.80.154.10: T330670
[17:42:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:41] <stashbot>	 T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670
[17:44:06] <subbu>	 i haven't got 412s on mwdebug .. so, go ahead with it.
[17:44:32] <zabe>	 cool, syncing
[17:46:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) `cr2-eqiad` (replicated to `cr1-eqiad` as well):  ` /* ns0 */ route 208.80.154.238/32 {     next-hop 208.80.154.10;     readvertise;     no-reso...
[17:46:10] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[17:46:22] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: move mediawiki ecs logs into mediawiki partition [puppet] - 10https://gerrit.wikimedia.org/r/895741 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[17:47:07] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/896110 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff)
[17:47:17] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[17:47:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T329260)', diff saved to https://phabricator.wikimedia.org/P45725 and previous config saved to /var/cache/conftool/dbconfig/20230309-174723-marostegui.json
[17:47:30] <stashbot>	 T329260: Drop cuc_comment from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T329260
[17:47:48] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[17:50:02] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:896030|Revert "TransformHandler: Load stashed page bundle based on ETag." (T331629)]] (duration: 11m 57s)
[17:50:08] <stashbot>	 T331629: HTTP 412 Errors when editing Officewiki - https://phabricator.wikimedia.org/T331629
[17:50:10] <zabe>	 subbu: should be live
[17:50:16] <subbu>	 ty
[17:51:10] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[17:52:21] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 (owner: 10Giuseppe Lavagetto)
[17:53:54] <sukhe>	 !log cr*-codfw [ns1]: set routing-options static route 208.80.153.231/32 next-hop 208.80.153.77: T330670
[17:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:59] <stashbot>	 T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670
[17:54:44] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[17:56:10] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[17:56:50] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) `cr2-codfw` (replicated to `cr1-codfw` as well):  ` /* ns1 */ route 208.80.153.231/32 {     next-hop 208.80.153.77;     readvertise;     no-reso...
[17:59:44] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[18:00:04] <jouncebot>	 bd808: That opportune time is upon us again. Time for a Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1800).
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1800)
[18:00:05] <sukhe>	 !log cr*-codfw [ns0]: set routing-options static route 208.80.154.238/32 next-hop 208.80.153.77: T330670
[18:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:11] <stashbot>	 T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670
[18:00:32] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:01:10] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[18:01:16] <topranks>	 ^^ the dns may be me sorry
[18:01:27] <sukhe>	 phew ok :) 
[18:01:30] <sukhe>	 thanks
[18:01:49] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[18:02:07] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2023-03-06-121941-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/896135 (owner: 10BryanDavis)
[18:03:33] <kizule>	 Hi, is something happening with local-image-codfw?
[18:03:41] <kizule>	 *local-swift-codfw
[18:03:55] <kizule>	 I’m unable to delete one image from Serbian Wikipedia.
[18:04:22] <kizule>	 Oh, I was able to delete it now after a few tries.
[18:07:16] <wikibugs>	 (03PS44) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[18:07:21] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2023-03-06-121941-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/896135 (owner: 10BryanDavis)
[18:08:04] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:08:25] <wikibugs>	 (03PS1) 10Jbond: apereo_cas: update idp logout script [puppet] - 10https://gerrit.wikimedia.org/r/896146
[18:08:36] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[18:09:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] apereo_cas: update idp logout script [puppet] - 10https://gerrit.wikimedia.org/r/896146 (owner: 10Jbond)
[18:09:10] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:09:24] <wikibugs>	 (03PS8) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663)
[18:09:33] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:09:41] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[18:10:09] <wikibugs>	 (03PS8) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646
[18:10:15] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:10:16] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[18:10:45] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:10:46] <wikibugs>	 (03CR) 10SBassett: "I confirm that is my new public key for wikimedia production.  Let me know if you'd like any additional verification!" [puppet] - 10https://gerrit.wikimedia.org/r/896024 (https://phabricator.wikimedia.org/T331554) (owner: 10MVernon)
[18:11:20] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[18:11:51] <wikibugs>	 (03PS9) 10Jbond: P:rsyslog: manage /etc/logrotate.d/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/894646
[18:12:12] <wikibugs>	 (03PS45) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[18:13:02] <wikibugs>	 (03CR) 10Jbond: "update" [puppet] - 10https://gerrit.wikimedia.org/r/894646 (owner: 10Jbond)
[18:14:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10SecTeam-Processed, 10Security: New production ssh key for sbassett - https://phabricator.wikimedia.org/T331554 (10sbassett) >>! In T331554#8678974, @MatthewVernon wrote: > @sbassett I've opened a CR to update your ssh key - if you can confirm it's corre...
[18:14:38] <wikibugs>	 (03PS12) 10JHathaway: Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554)
[18:15:19] <wikibugs>	 (03PS46) 10David Caro: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[18:15:38] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts authdns[1001,2001].wikimedia.org
[18:16:40] <wikibugs>	 (03PS9) 10David Caro: maintain-dbusers: add nicer logging with dry run prefix [puppet] - 10https://gerrit.wikimedia.org/r/895756 (https://phabricator.wikimedia.org/T303663)
[18:17:24] <wikibugs>	 (03PS2) 10SBassett: eswikiversity: Enable SFS in enforce mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) (owner: 10MarcoAurelio)
[18:18:09] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Forgot the +1 code-review." [puppet] - 10https://gerrit.wikimedia.org/r/895811 (owner: 10Jbond)
[18:18:39] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] "Happy to do a +2 and then config deploy as long as Reedy or anybody else do not have any objections.  I don't personally think this needs " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896085 (https://phabricator.wikimedia.org/T331182) (owner: 10MarcoAurelio)
[18:18:52] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) `00152: FAILED: internal_api_error_UploadChunkFileException: [dc0355d4-60e7-4764-8c67-8ac4166bed53...
[18:19:41] <wikibugs>	 (03PS1) 10Ssingh: hiera: remove decommissionned authdns[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/896151 (https://phabricator.wikimedia.org/T330670)
[18:20:44] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:21:06] <sukhe>	 ^ expected
[18:21:09] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[18:21:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: remove decommissionned authdns[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/896151 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[18:21:40] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:21:50] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:22:03] <sukhe>	 ^ expected, will resolve soon after homer
[18:22:12] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:22:17] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] sites.yaml: remove authdns[12]001 [homer/public] - 10https://gerrit.wikimedia.org/r/894102 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[18:22:28] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:22:29] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T331633 (10Legoktm)
[18:22:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[18:22:36] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:22:46] <sukhe>	 !log running puppet-agent on A:dns-auth to remove deprecated authdns[12]001
[18:22:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:13] <wikibugs>	 10SRE, 10Gerrit, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: reviewer-bot is not working - https://phabricator.wikimedia.org/T331626 (10Legoktm)
[18:23:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10Legoktm) 05Resolved→03Open p:05Triage→03Medium a:05hashar→03Marostegui
[18:24:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: authdns[1001,2001].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[18:25:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: authdns[1001,2001].wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[18:25:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:25:57] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts authdns[1001,2001].wikimedia.org
[18:26:05] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `authdns[1001,2001].wikimedia.org` - authdns1001.wikimedia.o...
[18:26:46] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:26:47] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add jaeger to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/888761 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway)
[18:28:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove authdns[12]001 [homer/public] - 10https://gerrit.wikimedia.org/r/894102 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[18:28:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10Legoktm) Re-opening just for tracking while we wait for the queue to go d...
[18:31:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10Legoktm) There are 2,936 emails in the out queue, it takes ~5.1 seconds t...
[18:31:31] <sukhe>	 !log homer "cr*-eqiad*" commit "Remove authdns1001 from homer, T330670"
[18:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:36] <stashbot>	 T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670
[18:32:57] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Cmjohnson) @MatthewVernon working on these now, I will let you know if I run into any blocks
[18:33:34] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:34:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] P:cumin: update alias for dns-auth to reflect changes to dns roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[18:34:30] <wikibugs>	 (03PS3) 10Ssingh: P:cumin: update alias for dns-auth to reflect changes to dns roles [puppet] - 10https://gerrit.wikimedia.org/r/894688 (https://phabricator.wikimedia.org/T330670)
[18:34:45] <sukhe>	 !log homer "cr*-codfw*" commit "Remove authdns1001 from homer, T330670"
[18:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10Jclark-ctr) @Cmjohnson  we have a few batteries   @BTullis if you can shut down server we can take care of it
[18:34:58] <sukhe>	 !log [correction] homer "cr*-codfw*" commit "Remove authdns2001 from homer, T330670"
[18:35:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:30] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:38:36] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[18:38:43] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[18:42:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10Legoktm) Sent [[ https://lists.wikimedia.org/hyperkitty/list/listadmins@l...
[18:43:48] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:44:05] <sukhe>	 !log disable puppet on A:dns-rec to merge CR 895894
[18:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:31] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dns::auth: deprecate role and update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[18:45:38] <wikibugs>	 (03PS2) 10Ssingh: dns::auth: deprecate role and update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/895894 (https://phabricator.wikimedia.org/T330670)
[18:47:01] <sukhe>	 !log enable puppet on dns4003 to merge 895894
[18:47:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:46] <logmsgbot>	 !log mforns@deploy2002 Started deploy [airflow-dags/analytics@3419b7d]: (no justification provided)
[18:50:57] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@3419b7d]: (no justification provided) (duration: 00m 10s)
[18:53:52] <sukhe>	 !log enable puppet on A:dns-rec and force puppet run: T330670
[18:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:57] <stashbot>	 T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670
[19:00:05] <jouncebot>	 jeena and jnuche: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1900).
[19:00:30] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:02:43] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896159 (https://phabricator.wikimedia.org/T330204)
[19:02:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896159 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot)
[19:03:29] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896159 (https://phabricator.wikimedia.org/T330204) (owner: 10TrainBranchBot)
[19:04:29] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) `00028: FAILED: internal_api_error_UploadChunkFileException: [f6b5ef11-ddeb-4e07-ba0a-0207b4d5f33c...
[19:06:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[19:07:55] <sukhe>	 I have changes in the netbox cookbook for
[19:07:57] <sukhe>	 +ms-fe1013                                1H IN A 10.64.48.149                                                                                                                                                                                 
[19:08:02] <sukhe>	 +ms-fe1013                                1H IN AAAA 2620:0:861:107:10:64:48:149                                                                                                                                                               
[19:08:06] <sukhe>	 is it fine to merge those?
[19:09:04] <sukhe>	 cmjohnson1: ^ last you worked on these? sorry if not
[19:10:53] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.26  refs T330204
[19:10:57] <stashbot>	 T330204: 1.40.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T330204
[19:12:48] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns1003 (renamed from authdns1001) - sukhe@cumin2002"
[19:14:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns1003 (renamed from authdns1001) - sukhe@cumin2002"
[19:14:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:14:52] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:15:14] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns1003
[19:15:40] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns1003
[19:15:46] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:17:46] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 179, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:17:53] <cmjohnson1>	 sukhe, I was working on them
[19:18:00] <cmjohnson1>	 do I need to start over?
[19:18:12] <sukhe>	 cmjohnson1: sorry, no, merged
[19:18:13] <sukhe>	 all good
[19:18:18] <sukhe>	 no changes pending
[19:18:24] <cmjohnson1>	 okay, thanks! 
[19:18:28] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:20:08] <wikibugs>	 (03PS1) 10Cathal Mooney: Homer changes as part of WMCS codfw migration to cloudsw1-b1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/896160 (https://phabricator.wikimedia.org/T327919)
[19:20:47] <wikibugs>	 (03PS3) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix)
[19:21:28] <wikibugs>	 (03PS2) 10Jbond: apereo_cas: update idp logout script [puppet] - 10https://gerrit.wikimedia.org/r/896146
[19:22:22] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Homer changes as part of WMCS codfw migration to cloudsw1-b1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/896160 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[19:23:02] <wikibugs>	 (03Merged) 10jenkins-bot: Homer changes as part of WMCS codfw migration to cloudsw1-b1-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/896160 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[19:35:27] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: Incr per-node shard recovery thru-put cap [puppet] - 10https://gerrit.wikimedia.org/r/895874 (https://phabricator.wikimedia.org/T317816) (owner: 10Ryan Kemper)
[19:39:31] <wikibugs>	 (03CR) 10Nray: "FYI, I'm planning to backport this in 1 hour" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray)
[19:39:48] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox: Consider DNSSec - https://phabricator.wikimedia.org/T26413 (10BCornwall) 05Stalled→03Open Setting to open since no work has begun to warrant a "stalled" status.
[19:41:27] <wikibugs>	 10SRE, 10Acme-chief, 10Traffic-Icebox: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10BCornwall) @Vgutierrez It looks like the work you've done means that this can be closed. Is that the case?
[19:46:35] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10User-jbond: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BCornwall) 05Stalled→03Resolved Seeing as @RLazarus has kindly merged in the functionality as dictated by this ticket, closing as resolved. Any oth...
[19:46:45] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 12:00:00 on an-worker1078.eqiad.wmnet with reason: Replacing RAID BBU
[19:46:58] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 12:00:00 on an-worker1078.eqiad.wmnet with reason: Replacing RAID BBU
[19:47:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d79d8e43-f7d6-4d5b-b758-f7be36ad2914) set by btullis@cumin1001 for 1 day, 12:00:00 on 1 host(s) and their services with...
[19:51:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID BBU for an-worker1078 - https://phabricator.wikimedia.org/T331544 (10BTullis) Thanks @Cmjohnson and @Jclark-ctr - I've shut down the machine and given it 36 hours of downtime. Please feel free to boot it whenever the battery is replaced, it should rejoin...
[19:51:23] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to enable incr shard recovery throughput - ryankemper@cumin1001 - T317816
[19:51:29] <stashbot>	 T317816: Enable 10G networking in cirrus elastic clusters - https://phabricator.wikimedia.org/T317816
[19:51:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for David Martin - https://phabricator.wikimedia.org/T331500 (10DMartin-WMF) Thanks so much, @MatthewVernon and all!
[19:52:30] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10BCornwall) Since Wikimania since 2019 lives under https://wikimania.wikimedia.org/wiki/<year>:Wikimania, can this be closed or is there some desire to co...
[19:56:55] <wikibugs>	 (03PS1) 10Ssingh: hiera: add host override for dns1003 [puppet] - 10https://gerrit.wikimedia.org/r/896169 (https://phabricator.wikimedia.org/T330670)
[19:58:08] <icinga-wm>	 PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:02:50] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add dns1003 [homer/public] - 10https://gerrit.wikimedia.org/r/896171 (https://phabricator.wikimedia.org/T330670)
[20:03:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: add host override for dns1003 [puppet] - 10https://gerrit.wikimedia.org/r/896169 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[20:06:08] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[20:07:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns1003 (renamed from authdns1001) - sukhe@cumin2002"
[20:09:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns1003 (renamed from authdns1001) - sukhe@cumin2002"
[20:09:02] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:12:19] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache dns1003.wikimedia.org on all recursors
[20:12:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dns1003.wikimedia.org on all recursors
[20:12:30] <wikibugs>	 (03PS1) 10Cathal Mooney: Add uRPF checks for new cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/896172 (https://phabricator.wikimedia.org/T327919)
[20:12:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1003.wikimedia.org with OS bullseye
[20:13:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye
[20:13:34] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add uRPF checks for new cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/896172 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[20:14:08] <wikibugs>	 (03Merged) 10jenkins-bot: Add uRPF checks for new cloudsw interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/896172 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[20:24:37] <topranks>	 !log move cloud-hosts1-b-codfw GW from core routers to cloudsw1-b1-codfw T327919
[20:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:43] <stashbot>	 T327919: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919
[20:25:14] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) Trying the same file on Wikisource: https://en.wikisource.org/wiki/File:Gide_-_The_Vatican_Swindle...
[20:25:20] <wikibugs>	 (03PS1) 10Samtar: InitialiseSettings-labs: Enable Phonos on Beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896176 (https://phabricator.wikimedia.org/T331670)
[20:25:43] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1003.wikimedia.org with OS bullseye
[20:25:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye executed with errors...
[20:28:28] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.297 second response time https://wikitech.wikimedia.org/wiki/Swift
[20:30:06] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Swift
[20:30:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns1003.wikimedia.org']
[20:38:38] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns1003.wikimedia.org']
[20:40:19] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: DNM: showcase fixtures for jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/896177
[20:42:34] <TheresNoTime>	 jouncebot: nowandnext
[20:42:34] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T1900)
[20:42:34] <jouncebot>	 In 0 hour(s) and 17 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T2100)
[20:43:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10Jclark-ctr) 05Open→03Resolved
[20:44:15] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[20:46:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns2003 (renamed from authdns2001) - sukhe@cumin2002"
[20:46:20] <TheresNoTime>	 doing a beta-only config deploy prior to the backport window
[20:46:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896176 (https://phabricator.wikimedia.org/T331670) (owner: 10Samtar)
[20:47:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns2003 (renamed from authdns2001) - sukhe@cumin2002"
[20:47:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:47:23] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable Phonos on Beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896176 (https://phabricator.wikimedia.org/T331670) (owner: 10Samtar)
[20:53:22] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1003.wikimedia.org with OS bullseye
[20:53:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye
[20:54:44] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove uRPF filter for interface ae1.2118 on codfw CRs [homer/public] - 10https://gerrit.wikimedia.org/r/896179 (https://phabricator.wikimedia.org/T327919)
[20:59:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache dns2003.wikimedia.org on all recursors
[20:59:40] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dns2003.wikimedia.org on all recursors
[21:00:04] <jouncebot>	 brennen and TheresNoTime: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230309T2100). Please do the needful.
[21:00:05] <jouncebot>	 James_F and nray: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:11] <James_F>	 Heya.
[21:00:17] <TheresNoTime>	 o/ I can deploy if needed
[21:00:22] <James_F>	 Sure.
[21:00:23] <nray>	 o/
[21:00:31] <James_F>	 Mine are trivial-ish.
[21:00:55] <TheresNoTime>	 will start with them :)
[21:01:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895351 (owner: 10Jforrester)
[21:01:55] <wikibugs>	 (03Merged) 10jenkins-bot: Unload RenameUser, now part of core: Part I of II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895351 (owner: 10Jforrester)
[21:02:05] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:895351|Unload RenameUser, now part of core: Part I of II]]
[21:02:30] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.wipe-cache dns2003.mgmt.codfw.wmnet on all recursors
[21:02:33] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dns2003.mgmt.codfw.wmnet on all recursors
[21:03:44] <logmsgbot>	 !log samtar@deploy2002 samtar and jforrester: Backport for [[gerrit:895351|Unload RenameUser, now part of core: Part I of II]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[21:04:00] <taavi>	 doesn't this week's train still have some extensions referring the Ext\Renameuser classes?
[21:04:16] <James_F>	 taavi: The core code sets the aliases I thought.
[21:04:21] <James_F>	 Hmm.
[21:04:48] <TheresNoTime>	 (waiting, though I have just tested the extension unloaded on en.wiki via the mwdebug and nothing fell over so..)
[21:04:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove uRPF filter for interface ae1.2118 on codfw CRs [homer/public] - 10https://gerrit.wikimedia.org/r/896179 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[21:05:49] <taavi>	 no, the aliases are in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Renameuser/+/refs/heads/master/includes/RenameUserSetup.php
[21:05:57] <James_F>	 Bleh.
[21:06:14] <James_F>	 I mean, these are only used on private wikis in practice.
[21:06:15] <TheresNoTime>	 would you like me to rollback?
[21:06:20] <James_F>	 Maybe.
[21:06:35] <James_F>	 But the problem is we have to land the i18n one before the train next week.
[21:06:48] <wikibugs>	 (03Merged) 10jenkins-bot: Remove uRPF filter for interface ae1.2118 on codfw CRs [homer/public] - 10https://gerrit.wikimedia.org/r/896179 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[21:07:04] <James_F>	 So maybe better to back-port class changes if they blow up?
[21:07:06] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1003.wikimedia.org with OS bullseye
[21:07:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye executed with errors...
[21:08:13] <TheresNoTime>	 James_F: your call
[21:08:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns1003
[21:08:29] <James_F>	 TheresNoTime: Let's proceed. I'll fix things if they break.
[21:08:34] <TheresNoTime>	 ack :)
[21:09:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns1003
[21:09:41] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns2003
[21:09:45] <taavi>	 I fear that things will break silently as features might be gated behind isLoaded() calls
[21:10:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1003.wikimedia.org with OS bullseye
[21:10:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye
[21:10:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns2003
[21:14:24] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:895351|Unload RenameUser, now part of core: Part I of II]] (duration: 12m 19s)
[21:14:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:15:22] <TheresNoTime>	 James_F: okay to move on to the next patch?
[21:15:26] <James_F>	 TheresNoTime: Yes
[21:15:40] <TheresNoTime>	 (and fwiw, https://codesearch.wmcloud.org/search/?q=MediaWiki%5C%5CExtension%5C%5CRenameuser&i=nope&files=&excludeFiles=&repos= doesn't *seem* to suggest much is using that namespace..?)
[21:16:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895352 (owner: 10Jforrester)
[21:16:48] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:16:58] <wikibugs>	 (03Merged) 10jenkins-bot: Unload RenameUser, now part of core: Part II of II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/895352 (owner: 10Jforrester)
[21:17:12] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:895352|Unload RenameUser, now part of core: Part II of II]]
[21:17:35] <James_F>	 TheresNoTime: Yeah, a bunch of things have been fixed in the last few days.
[21:17:55] <James_F>	 taavi: Possibly; I'd have expected it to mostly show up in type errors, which are very noisy in prod.
[21:18:48] <logmsgbot>	 !log samtar@deploy2002 samtar and jforrester: Backport for [[gerrit:895352|Unload RenameUser, now part of core: Part II of II]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:18:58] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adjust and remove reverse DNS records after cloudsw1-b1-codfw migration. - cmooney@cumin1001"
[21:19:15] <TheresNoTime>	 going to continue the sync
[21:19:18] <wikibugs>	 (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[21:19:27] <James_F>	 TheresNoTime: Thanks!
[21:19:40] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to enable incr shard recovery throughput - ryankemper@cumin1001 - T317816
[21:19:45] <stashbot>	 T317816: Enable 10G networking in cirrus elastic clusters - https://phabricator.wikimedia.org/T317816
[21:19:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:20:01] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adjust and remove reverse DNS records after cloudsw1-b1-codfw migration. - cmooney@cumin1001"
[21:20:01] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:23:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:24:50] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:895352|Unload RenameUser, now part of core: Part II of II]] (duration: 07m 38s)
[21:25:52] <TheresNoTime>	 deployed :) nray, you ready?
[21:26:05] <nray>	 yes, thank you! TheresNoTime 
[21:26:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray)
[21:26:57] <wikibugs>	 (03Merged) 10jenkins-bot: Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893542 (https://phabricator.wikimedia.org/T326829) (owner: 10Nray)
[21:27:09] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:893542|Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 (T326829)]]
[21:27:14] <stashbot>	 T326829: Make languages available to index crawlers in mobile version of article pages - https://phabricator.wikimedia.org/T326829
[21:28:45] <logmsgbot>	 !log samtar@deploy2002 samtar and nray: Backport for [[gerrit:893542|Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 (T326829)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[21:28:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:28:59] <TheresNoTime>	 nray: that's live on mwdebug, do you need to test it?
[21:29:12] <nray>	 TheresNoTime: Yes, I'll take a look. Thank you
[21:32:11] <nray>	 TheresNoTime: Looks good! You can proceed
[21:32:18] <TheresNoTime>	 ack :)
[21:35:37] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns1003.wikimedia.org with OS bullseye
[21:35:45] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye executed with errors...
[21:35:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns1003.wikimedia.org with OS bullseye
[21:35:57] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye
[21:37:53] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:893542|Replace Cleopatra page with United_States to facilitate synthetic testing of T326829 (T326829)]] (duration: 10m 43s)
[21:37:58] <stashbot>	 T326829: Make languages available to index crawlers in mobile version of article pages - https://phabricator.wikimedia.org/T326829
[21:38:01] <TheresNoTime>	 nray: deployed :)
[21:38:09] <nray>	 TheresNoTime: Thank you for your help!
[21:38:55] <TheresNoTime>	 !log close UTC late backport
[21:38:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:17] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10Papaul) 05Open→03Resolved All those nodes are back up now in codfw we can resolve this task
[21:49:43] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns1003.wikimedia.org with reason: host reimage
[21:51:31] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Dzahn) Hard to tell because every year the organizers of Wikimania are different people. But from experience this does tend to come back every year and m...
[21:52:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye
[21:52:21] <wikibugs>	 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye
[21:52:28] <wikibugs>	 (03PS1) 10Ssingh: hiera: add host override for dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/896181 (https://phabricator.wikimedia.org/T330670)
[21:53:06] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns1003.wikimedia.org with reason: host reimage
[21:54:41] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: add host override for dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/896181 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[21:56:22] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2003.wikimedia.org with OS bullseye
[21:56:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye
[22:01:37] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add dns2003 [homer/public] - 10https://gerrit.wikimedia.org/r/896183 (https://phabricator.wikimedia.org/T330670)
[22:02:51] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns2003.wikimedia.org with OS bullseye
[22:03:00] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye executed with errors...
[22:03:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2003.wikimedia.org with OS bullseye
[22:03:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye
[22:14:21] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[22:14:23] <wikibugs>	 (03PS1) 10Ssingh: hiera: add dns[12]003 to ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/896185 (https://phabricator.wikimedia.org/T330670)
[22:16:15] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10BCornwall) It's a bit bizarre to want them since wikimania.wikimedia.org should default to the latest upcoming conference, wouldn't it?
[22:16:31] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2003.wikimedia.org with reason: host reimage
[22:18:59] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[22:19:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2003.wikimedia.org with reason: host reimage
[22:20:00] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Dzahn) In the past each Wikimania had its own wiki. I think that's where that comes from. They used to be individual wikis. And each Wikimania has a tota...
[22:20:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[22:20:27] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns1003.wikimedia.org with OS bullseye
[22:20:34] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns1003.wikimedia.org with OS bullseye completed: - dns1003...
[22:24:17] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10BCornwall) But now that there's a single wiki, isn't the idea of having domains with the year on them moot?
[22:24:45] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for new links to cloudsw1-b1-codfw - cmooney@cumin1001"
[22:25:49] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for new links to cloudsw1-b1-codfw - cmooney@cumin1001"
[22:25:49] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:30:14] <wikibugs>	 (03PS1) 10BryanDavis: Revert "striker: Bump container version to 2023-03-09-005633-production" [puppet] - 10https://gerrit.wikimedia.org/r/896031 (https://phabricator.wikimedia.org/T331674)
[22:33:32] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "PCC output: https://puppet-compiler.wmflabs.org/output/896031/40067/" [puppet] - 10https://gerrit.wikimedia.org/r/896031 (https://phabricator.wikimedia.org/T331674) (owner: 10BryanDavis)
[22:34:54] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Aklapper) Please see the child task T202684#5735025. This task has the status `stalled` as it's blocked on T202684. No need to fragment more discussions...
[22:37:42] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Revert "striker: Bump container version to 2023-03-09-005633-production" [puppet] - 10https://gerrit.wikimedia.org/r/896031 (https://phabricator.wikimedia.org/T331674) (owner: 10BryanDavis)
[22:40:58] <bd808>	 !log Forced puppet run on cloudweb100[34] to apply quick fix for T331674
[22:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:03] <stashbot>	 T331674: Some tool maintainers not showing in Striker UI - https://phabricator.wikimedia.org/T331674
[22:41:30] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: add dns[12]003 to ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/896185 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[22:41:48] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[22:43:44] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[22:43:45] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2003.wikimedia.org with OS bullseye
[22:43:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2003.wikimedia.org with OS bullseye completed: - dns2003...
[22:46:39] <wikibugs>	 (03PS1) 10JHathaway: aux: explicitly disable istio injection on namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/896188 (https://phabricator.wikimedia.org/T325178)
[22:46:52] <wikibugs>	 (03Abandoned) 10Ssingh: sites.yaml: add dns2003 [homer/public] - 10https://gerrit.wikimedia.org/r/896183 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[22:47:00] <wikibugs>	 (03Abandoned) 10Ssingh: sites.yaml: add dns1003 [homer/public] - 10https://gerrit.wikimedia.org/r/896171 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[22:47:14] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bullseye
[22:47:18] <wikibugs>	 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye executed with errors: - sretest2002 (**FAIL**)   - Removed from Puppet and P...
[22:48:50] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10BCornwall) @Aklapper, Thanks for linking that. I'm still confused as that seems to be another task entirely: That one is about importing **older** wikis...
[22:49:12] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add dns[12]003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/896190 (https://phabricator.wikimedia.org/T330670)
[22:51:57] <wikibugs>	 (03PS1) 10Ssingh: hiera: add dns[12]003 to authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/896191 (https://phabricator.wikimedia.org/T330670)
[22:53:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add dns[12]003 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/896190 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[22:53:45] <sukhe>	 !log run homer in cr*-{codfw,eqiad} for CR 896190
[22:53:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:55:28] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux: explicitly disable istio injection on namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/896188 (https://phabricator.wikimedia.org/T325178) (owner: 10JHathaway)
[22:58:59] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: add dns[12]003 to authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/896191 (https://phabricator.wikimedia.org/T330670) (owner: 10Ssingh)
[23:01:04] <sukhe>	 !log pool new dns hosts dns1003 and dns2003: T330670
[23:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:09] <stashbot>	 T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670
[23:04:42] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'.
[23:04:44] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[23:09:09] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[23:09:13] <logmsgbot>	 !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[23:27:27] <wikibugs>	 (03PS1) 10BryanDavis: striker: Bump container version to 2023-03-09-185548-production [puppet] - 10https://gerrit.wikimedia.org/r/896194 (https://phabricator.wikimedia.org/T330759)
[23:32:57] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@b122672]: import_ttl: replace HdfsSensor with URLSensor
[23:33:11] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@b122672]: import_ttl: replace HdfsSensor with URLSensor (duration: 00m 14s)
[23:47:51] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/894744/40070/" [puppet] - 10https://gerrit.wikimedia.org/r/894744 (https://phabricator.wikimedia.org/T330091) (owner: 10Dzahn)
[23:52:26] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@7b25fbf]: import_ttl: correct date formatting
[23:52:40] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@7b25fbf]: import_ttl: correct date formatting (duration: 00m 14s)
[23:57:52] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "SGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/895878 (https://phabricator.wikimedia.org/T284555) (owner: 10BCornwall)