[00:02:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051218 (owner: 10TrainBranchBot)
[00:13:15] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:14:30] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:14:32] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[00:14:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1037.eqiad.wmnet with OS bullseye comp...
[00:14:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P65606 and previous config saved to /var/cache/conftool/dbconfig/20240702-001448-marostegui.json
[00:15:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:16:17] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[00:16:21] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1038.eqiad.wmnet with OS bullseye
[00:16:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye comp...
[00:16:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942777 (10Jclark-ctr)
[00:16:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942778 (10Jclark-ctr) a:03Jclark-ctr
[00:17:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942779 (10Jclark-ctr)
[00:18:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942781 (10Jclark-ctr) @VRiley-WMF if you can update with 2nd network connection then hand over to @cmooney
[00:21:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368866#9942784 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate of T362033
[00:23:50] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9942789 (10Jclark-ctr) @BTullis  if you get a chance to update files.  These are ready to be imaged and handed over
[00:27:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9942790 (10Jclark-ctr) @Andrew @dcaro thank you for providing update  did you have host names for this and  please update preseed.yaml, and site.pp
[00:29:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P65607 and previous config saved to /var/cache/conftool/dbconfig/20240702-002955-marostegui.json
[00:32:46] <jinxer-wm>	 FIRING: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[00:45:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T364069)', diff saved to https://phabricator.wikimedia.org/P65608 and previous config saved to /var/cache/conftool/dbconfig/20240702-004502-marostegui.json
[00:45:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance
[00:45:05] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[00:45:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance
[00:45:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T364069)', diff saved to https://phabricator.wikimedia.org/P65609 and previous config saved to /var/cache/conftool/dbconfig/20240702-004524-marostegui.json
[00:45:57] <wikibugs>	 (03PS3) 10RLazarus: deployment_server: Add --follow, --attach to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966)
[00:47:11] <wikibugs>	 (03CR) 10RLazarus: deployment_server: Add --follow, --attach to mwscript_k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus)
[00:52:46] <jinxer-wm>	 RESOLVED: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[01:02:48] <wikibugs>	 (03PS6) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151)
[01:08:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.12 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051223 (https://phabricator.wikimedia.org/T366957)
[01:08:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.12 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051223 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot)
[01:18:27] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:20:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.015s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:23:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:25:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.015s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:29:59] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.12 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051223 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot)
[01:46:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:50:58] <wikibugs>	 06SRE: wikipedia-pl-sysop: local images fail to generate thumbnail - https://phabricator.wikimedia.org/T368945#9942841 (10Peachey88)
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0200)
[02:39:16] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:59:16] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0300)
[03:01:54] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051231 (https://phabricator.wikimedia.org/T366957)
[03:01:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051231 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot)
[03:02:36] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051231 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot)
[03:03:06] <logmsgbot>	 !log mwpresync@deploy1002 Started scap sync-world: testwikis wikis to 1.43.0-wmf.12  refs T366957
[03:03:09] <stashbot>	 T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957
[03:06:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:21:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T367856)', diff saved to https://phabricator.wikimedia.org/P65610 and previous config saved to /var/cache/conftool/dbconfig/20240702-032121-marostegui.json
[03:21:25] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[03:27:00] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[03:36:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P65611 and previous config saved to /var/cache/conftool/dbconfig/20240702-033628-marostegui.json
[03:39:00] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[03:48:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T364069)', diff saved to https://phabricator.wikimedia.org/P65612 and previous config saved to /var/cache/conftool/dbconfig/20240702-034805-marostegui.json
[03:48:14] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[03:51:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P65613 and previous config saved to /var/cache/conftool/dbconfig/20240702-035135-marostegui.json
[03:54:39] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.12  refs T366957 (duration: 51m 33s)
[03:54:42] <stashbot>	 T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957
[04:00:04] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0400)
[04:01:06] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.9 (duration: 01m 02s)
[04:03:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P65614 and previous config saved to /var/cache/conftool/dbconfig/20240702-040312-marostegui.json
[04:06:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T367856)', diff saved to https://phabricator.wikimedia.org/P65615 and previous config saved to /var/cache/conftool/dbconfig/20240702-040643-marostegui.json
[04:06:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[04:06:46] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[04:06:58] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[04:07:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T367856)', diff saved to https://phabricator.wikimedia.org/P65616 and previous config saved to /var/cache/conftool/dbconfig/20240702-040705-marostegui.json
[04:18:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P65617 and previous config saved to /var/cache/conftool/dbconfig/20240702-041819-marostegui.json
[04:33:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T364069)', diff saved to https://phabricator.wikimedia.org/P65618 and previous config saved to /var/cache/conftool/dbconfig/20240702-043326-marostegui.json
[04:33:29] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[04:33:32] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[04:33:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[04:33:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T364069)', diff saved to https://phabricator.wikimedia.org/P65619 and previous config saved to /var/cache/conftool/dbconfig/20240702-043349-marostegui.json
[04:47:20] <icinga-wm>	 RECOVERY - Disk space on build2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops
[04:57:38] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 3 (deploy1003, ...), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:58:50] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T368371
[04:58:53] <stashbot>	 T368371: Switchover s8 master (db1192 -> db1209) - https://phabricator.wikimedia.org/T368371
[04:58:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1209 with weight 0 T368371', diff saved to https://phabricator.wikimedia.org/P65620 and previous config saved to /var/cache/conftool/dbconfig/20240702-045856-marostegui.json
[04:59:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T368371
[04:59:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1209 remove from API T368371', diff saved to https://phabricator.wikimedia.org/P65621 and previous config saved to /var/cache/conftool/dbconfig/20240702-045929-marostegui.json
[04:59:55] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049478 (https://phabricator.wikimedia.org/T368371)
[04:59:57] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1049479 (https://phabricator.wikimedia.org/T368371)
[05:00:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049478 (https://phabricator.wikimedia.org/T368371) (owner: 10Gerrit maintenance bot)
[05:23:55] <marostegui>	 !log Starting s8 eqiad failover from db1192 to db1209 - T368371
[05:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:23:59] <stashbot>	 T368371: Switchover s8 master (db1192 -> db1209) - https://phabricator.wikimedia.org/T368371
[05:24:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T368371', diff saved to https://phabricator.wikimedia.org/P65622 and previous config saved to /var/cache/conftool/dbconfig/20240702-052408-marostegui.json
[05:24:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1209 to s8 primary and set section read-write T368371', diff saved to https://phabricator.wikimedia.org/P65623 and previous config saved to /var/cache/conftool/dbconfig/20240702-052447-marostegui.json
[05:25:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1192 T368371', diff saved to https://phabricator.wikimedia.org/P65624 and previous config saved to /var/cache/conftool/dbconfig/20240702-052543-root.json
[05:26:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1049479 (https://phabricator.wikimedia.org/T368371) (owner: 10Gerrit maintenance bot)
[05:27:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65625 and previous config saved to /var/cache/conftool/dbconfig/20240702-052759-root.json
[05:28:37] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9943050 (10Marostegui)
[05:43:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65626 and previous config saved to /var/cache/conftool/dbconfig/20240702-054304-root.json
[05:45:24] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Rebuild images to pick up a new version of glogger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051243 (https://phabricator.wikimedia.org/T368640)
[05:47:19] <wikibugs>	 (03PS2) 10Marostegui: orchestrator: Add volans to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1051141
[05:47:19] <wikibugs>	 (03PS1) 10Marostegui: filtered_tables.txt: Remove flaggedpage_pending flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051244 (https://phabricator.wikimedia.org/T368939)
[05:47:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] orchestrator: Add volans to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1051141 (owner: 10Marostegui)
[05:47:54] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove flaggedpage_pending flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051244 (https://phabricator.wikimedia.org/T368939) (owner: 10Marostegui)
[05:51:08] <wikibugs>	 (03PS1) 10Marostegui: table_jobs.yaml: Remove flaggedpage_pending and flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051245 (https://phabricator.wikimedia.org/T365568)
[05:58:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65627 and previous config saved to /var/cache/conftool/dbconfig/20240702-055809-root.json
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0600).
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:13:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65628 and previous config saved to /var/cache/conftool/dbconfig/20240702-061315-root.json
[06:16:08] <icinga-wm>	 PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim
[06:18:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Rebuild images to pick up a new version of glogger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051243 (https://phabricator.wikimedia.org/T368640) (owner: 10Giuseppe Lavagetto)
[06:19:12] <icinga-wm>	 RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim
[06:20:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] thanos: increase query frontend and store cache sizes [puppet] - 10https://gerrit.wikimedia.org/r/1051177 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron)
[06:21:16] <_joe_>	 !log rebuilding httpd-fcgi, mediawiki-httpd images T363342 T368640
[06:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:21:20] <stashbot>	 T363342: glogger crashes regularly in mw-on-k8s containers - https://phabricator.wikimedia.org/T363342
[06:21:21] <stashbot>	 T368640: glogger produces invalid JSON when given input with non-printable characters - https://phabricator.wikimedia.org/T368640
[06:21:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088) (owner: 10Herron)
[06:24:07] <_joe_>	 jouncebot: now
[06:24:07] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0600)
[06:24:07] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0600)
[06:24:48] <_joe_>	 marostegui: lmk when you're done, I want to do a null deployment with scap to ensure my new image versions don't mess up something
[06:28:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65629 and previous config saved to /var/cache/conftool/dbconfig/20240702-062820-root.json
[06:31:47] <_joe_>	 ok I guess I can go on
[06:35:04] <logmsgbot>	 !log oblivian@deploy1002 Started scap sync-world: Rebuilding images for change to the base image for httpd
[06:43:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65630 and previous config saved to /var/cache/conftool/dbconfig/20240702-064326-root.json
[06:47:43] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] table_jobs.yaml: Remove flaggedpage_pending and flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051245 (https://phabricator.wikimedia.org/T365568) (owner: 10Marostegui)
[06:56:41] <_joe_>	 jouncebot: next
[06:56:41] <jouncebot>	 In 0 hour(s) and 3 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0700)
[06:57:20] <_joe_>	 oh nothing in the deployment calendar, so i guess it's not a problem if my full rebuild scap lasts a little later
[06:58:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65631 and previous config saved to /var/cache/conftool/dbconfig/20240702-065831-root.json
[06:59:34] <wikibugs>	 (03PS1) 10Kosta Harlan: Revert "QuickSurveys: Add testing survey configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459)
[06:59:44] <kostajh>	 _joe_: I'm about to add something to the calendar
[06:59:54] <XioNoX>	 !log update netboot bookworm image to pickup new point release
[06:59:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:02] <kostajh>	 _joe_: but it is very low priority so could be done later
[07:00:04] <_joe_>	 kostajh: by the time you're done my deployment will be done :)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:21] <_joe_>	 kostajh: no please go on
[07:00:34] <_joe_>	 in 2-3 minutes tops my deployment is done
[07:01:17] <kostajh>	 ack
[07:01:21] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: Rebuilding images for change to the base image for httpd (duration: 26m 52s)
[07:01:26] <_joe_>	 and done :)
[07:02:00] <wikibugs>	 (03PS2) 10Kosta Harlan: Revert "QuickSurveys: Add testing survey configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459)
[07:03:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan)
[07:04:07] <kostajh>	 ok, starting deploy
[07:04:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan)
[07:05:28] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "QuickSurveys: Add testing survey configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan)
[07:06:10] <logmsgbot>	 !log kharlan@deploy1002 Started scap sync-world: Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]]
[07:06:13] <stashbot>	 T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459
[07:06:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:14:22] <kostajh>	 scap seems to be stuck on `Started docker pull on k8s nodes` at 99%
[07:16:35] <kostajh>	 restarting the process 
[07:16:42] <logmsgbot>	 !log kharlan@deploy1002 Started scap sync-world: Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]]
[07:16:45] <stashbot>	 T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459
[07:16:54] <_joe_>	 kostajh: it's not stuck...
[07:17:20] <kostajh>	 _joe_: oh. how could you tell? I restarted it already
[07:17:32] <kostajh>	 there was no output on the 99% stage after 5 minutes
[07:18:46] <kostajh>	 it's at `07:17:49 docker_pull_k8s:  99% (in-flight: 2; ok: 428; fail: 0; left: 0)` again, I'll be more patient this time
[07:19:24] <_joe_>	 kostajh: it will timeout eventually, it's possible there's some nodes down/unresponsive
[07:19:31] * _joe_ afk
[07:21:23] <kostajh>	 urbanecm / Amir1 can you advise on what I should do if it times out? Do I need to make a revert of patch and try to sync it, even if the first patch failed to sync?
[07:24:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T364069)', diff saved to https://phabricator.wikimedia.org/P65632 and previous config saved to /var/cache/conftool/dbconfig/20240702-072426-marostegui.json
[07:24:29] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[07:32:47] <urbanecm>	 kostajh: during the pull k8s stage, nothing, as long as it is not affecting like half of the nodes.
[07:33:24] <urbanecm>	 iirc, scap will not even complain about the timeout in a hard way, it'll just continue
[07:34:05] <urbanecm>	 source: https://wm-bot.wmcloud.org/browser/index.php?start=06%2F17%2F2024&end=06%2F17%2F2024&display=%23wikimedia-operations (2024-06-17 13:35:34)
[07:37:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[07:37:44] <jinxer-wm>	 Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ...
[07:37:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[07:39:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P65633 and previous config saved to /var/cache/conftool/dbconfig/20240702-073933-marostegui.json
[07:40:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] table_jobs.yaml: Remove flaggedpage_pending and flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051245 (https://phabricator.wikimedia.org/T365568) (owner: 10Marostegui)
[07:42:22] <wikibugs>	 (03CR) 10JMeybohm: ""minor" 😄 - thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[07:43:38] <icinga-wm>	 RECOVERY - Disk space on an-web1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops
[07:47:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: librenms: use ec certificates only [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008)
[07:51:54] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:51:57] <stashbot>	 T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459
[07:52:49] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Continuing with sync
[07:54:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P65634 and previous config saved to /var/cache/conftool/dbconfig/20240702-075440-marostegui.json
[07:57:44] <jinxer-wm>	 FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-int.eqiad.main in mw-api-int at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[07:58:14] <wikibugs>	 (03CR) 10Fabfur: "with the base64rawurl decoder we can avoid hex" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[07:58:28] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]] (duration: 41m 45s)
[07:58:30] <stashbot>	 T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459
[07:59:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T367856)', diff saved to https://phabricator.wikimedia.org/P65635 and previous config saved to /var/cache/conftool/dbconfig/20240702-075904-marostegui.json
[07:59:07] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[08:00:04] <jouncebot>	 hashar and jeena: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0800)
[08:00:29] <MatmaRex>	 urbanecm: hi, you around? could you check how's the maintenance script from yeterday doing?
[08:00:33] <urbanecm>	 sure
[08:00:46] <urbanecm>	 MatmaRex: it is completed
[08:00:50] <MatmaRex>	 nice
[08:00:57] <urbanecm>	 MatmaRex: do you want the log?
[08:01:06] <jayme>	 !log cordon kubernetes1051.eqiad.wmnet because of several failed image pulls
[08:01:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:29] <MatmaRex>	 urbanecm: yeah, if it isn't a big chore, can you drop it on the task? thank you
[08:02:44] <jinxer-wm>	 FIRING: [4x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-ext.eqiad.main in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[08:02:56] <jayme>	 hashar | jeena: can you hold for a minute please? I'd like to double check the backport deploy because ^ 
[08:03:15] <urbanecm>	 MatmaRex: no problem, published at https://phabricator.wikimedia.org/T356196#9943331
[08:03:25] <MatmaRex>	 thanks
[08:03:25] <kostajh>	 backport failed
[08:03:39] <kostajh>	 `backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=kharlan', 'Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]]']' returned non-zero exit status`
[08:03:40] <stashbot>	 T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459
[08:03:49] <jayme>	 kostajh: what did it say ...ah
[08:04:05] <jayme>	 very informative :D
[08:04:21] <urbanecm>	 kostajh: there should be a more detailed error message somewhere up
[08:04:27] <kostajh>	 looking
[08:04:34] <kostajh>	 at 07:50 there was `07:50:05 1 K8s nodes failed to pull the multiversion image`
[08:04:47] <kostajh>	 followed by `07:50:05 Finished docker pull on k8s nodes (duration: 32m 40s)`
[08:05:21] <kostajh>	 and at 7:50:05 there was also `07:50:05 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-07-02-071649-publish (ran as mwdeploy@kubernetes1051.eqiad.wmnet) returned [143]: Pulling 'docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-07-02-071649-publish'...` which ended with `Terminated`
[08:05:22] <jayme>	 there is one node not behaving properly (kubernetes1051.eqiad.wmnet) ..and not failing properly
[08:05:51] <kostajh>	 I guess that is `ran as mwdeploy@kubernetes1051.eqiad.wmnet`
[08:06:12] <jayme>	 yes
[08:06:50] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru
[08:07:04] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru
[08:07:19] <jayme>	 !log draining kubernetes1051.eqiad.wmnet
[08:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:22] <kostajh>	 what (if anything) should I do? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1051246 is now merged but not deployed.
[08:08:15] <jayme>	 kostajh: I'm not 100% sure as we're now spilling in the train window...cc hashar/jeena
[08:08:38] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2024-07-02-060114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051290 (https://phabricator.wikimedia.org/T364525)
[08:09:04] <jayme>	 I've taken out kubernetes1051 so if that was the problem (which I suspect) retying should work in a minute
[08:09:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T364069)', diff saved to https://phabricator.wikimedia.org/P65637 and previous config saved to /var/cache/conftool/dbconfig/20240702-080948-marostegui.json
[08:09:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance
[08:09:51] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[08:10:01] <jayme>	 ah, AIUI train is not going to happen because https://phabricator.wikimedia.org/T366957
[08:10:04] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance
[08:10:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[08:10:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[08:10:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T364069)', diff saved to https://phabricator.wikimedia.org/P65638 and previous config saved to /var/cache/conftool/dbconfig/20240702-081025-marostegui.json
[08:10:45] <kostajh>	 jayme: ok please let me know when I should retry
[08:11:03] <jayme>	 kostajh: sure, give me 5'
[08:11:08] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2027.*} and A:cp
[08:11:23] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin_ng: Switch eqiad und codfw wikikube clusters to PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[08:12:44] <jinxer-wm>	 RESOLVED: [4x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-ext.eqiad.main in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[08:12:59] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2027.*} and A:cp
[08:13:36] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2028.*} and A:cp
[08:14:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P65639 and previous config saved to /var/cache/conftool/dbconfig/20240702-081411-marostegui.json
[08:14:33] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Switch eqiad und codfw wikikube clusters to PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[08:14:51] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2028.*} and A:cp
[08:15:30] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051291 (https://phabricator.wikimedia.org/T366957)
[08:15:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051291 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot)
[08:15:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[08:15:48] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2030.*} and A:cp
[08:15:50] <kostajh>	 I guess train deployment is happening?
[08:16:08] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051291 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot)
[08:16:19] <jayme>	 kostajh: looks like it :/
[08:16:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[08:17:03] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2030.*} and A:cp
[08:17:37] <jayme>	 question is who is running it :)
[08:19:59] <wikibugs>	 (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051292 (https://phabricator.wikimedia.org/T367756)
[08:20:27] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2031.*} and A:cp
[08:20:28] <jayme>	 kostajh: I'd say we wait...at least I'm not sure whats supposed to happen rn. Are you okay with re-trying in the afternoon window? Or will the train roll out your change anyways?
[08:20:40] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051292 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[08:22:16] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2031.*} and A:cp
[08:23:12] <hashar>	 docker_pull_k8s:  99% (in-flight: 1; ok: 429; fail: 0; left: 0)
[08:23:24] <hashar>	 looks like one of the k8s worker is a little slow for some reason ;)
[08:23:39] <hashar>	 kostajh: jayme: yes I have started the train, sorry I forgot to check here :/
[08:24:02] <jayme>	 hashar: we tried to reach out...one of the nodes is borked and will not pull the image
[08:24:09] <hashar>	 ah ok
[08:24:29] <hashar>	 my guess is the docker pull made by scap does not have a timeout
[08:24:32] <jayme>	 but it will also not run mw as it's cordoned now. I'm unsure how scap will handle that though
[08:26:34] <hashar>	 so we gotta remove it from the dsh group
[08:28:12] <hashar>	 08:27:53 docker_pull_k8s: 100% (in-flight: 0; ok: 430; fail: 0; left: 0)        
[08:28:19] <hashar>	 08:27:53 docker_pull_k8s: 100% (in-flight: 0; ok: 430; fail: 0; left: 0)        
[08:28:24] <hashar>	 someone it managed to pass
[08:29:09] <jayme>	 good. I've removed all workload from the probematic node, maybe that helped
[08:29:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P65640 and previous config saved to /var/cache/conftool/dbconfig/20240702-082918-marostegui.json
[08:30:03] <hashar>	 jayme: I apologize I should have checked on this channel before starting
[08:30:05] <jayme>	 I'll set it to inactive anyways. AIUI that should nowdays prevent scap from trying to pull the image there
[08:30:07] <wikibugs>	 (03PS1) 10Slyngshede: LDAP key sync: Improvements to SSH key sync with LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1051293 (https://phabricator.wikimedia.org/T366525)
[08:30:07] <kostajh>	 jayme: I can sync my patch later
[08:30:15] <hashar>	 I am running the train over a Google Meet with Arnaud this morning, and did not look at IRC :/
[08:30:31] <kostajh>	 I just don't know if it's problematic to have a patch in mediawiki-config merged that is not actually deployed
[08:30:40] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=kubernetes1051.eqiad.wmnet
[08:30:52] <hashar>	 I have no idea 
[08:31:06] <hashar>	 my guess is that if it is not pulled  on the dpeloyment server it is not included in the image
[08:32:34] <kostajh>	 hashar: should I try to sync it again now?
[08:33:05] <hashar>	 the train is going on 
[08:33:14] <hashar>	 08:33:09 K8s deployment progress:  67% (ok: 1455; fail: 0; left: 697) |         
[08:34:13] <hashar>	 kostajh: so your patch got merged, I ran `scap train` which sends a patch to mediawiki-config to switch the versions
[08:34:15] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru
[08:34:21] <hashar>	 which comes AFTER your patch
[08:34:30] <hashar>	 and thus I am currently deploying your config change
[08:34:37] <hashar>	 (as well as switching the group0 wikis)
[08:34:45] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.12  refs T366957
[08:34:48] <stashbot>	 T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957
[08:35:06] <hashar>	 kostajh: should be good now
[08:35:13] <hashar>	 sorry for the screw up :-\\\\
[08:35:55] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943442 (10SGupta-WMF) Thank you @scott_french for detailed explanation , I am...
[08:36:13] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru
[08:38:11] <kostajh>	 hashar: ah ok, so I don't need to do anything else?
[08:38:19] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp6009.*} and A:cp
[08:38:39] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[08:39:05] <hashar>	 kostajh: nop! I have sneakily deployed it!
[08:40:50] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp6009.*} and A:cp
[08:43:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:44:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T367856)', diff saved to https://phabricator.wikimedia.org/P65641 and previous config saved to /var/cache/conftool/dbconfig/20240702-084425-marostegui.json
[08:44:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[08:44:28] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[08:44:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[08:44:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T367856)', diff saved to https://phabricator.wikimedia.org/P65642 and previous config saved to /var/cache/conftool/dbconfig/20240702-084447-marostegui.json
[08:45:20] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:45:37] <kostajh>	 hashar: excellent :) thx
[08:45:52] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:46:28] <wikibugs>	 (03CR) 10Urbanecm: [C:04-1] "FWIW, this is not the requirement. It is perfectly fine to add settings to WMF config that are not yet in extension.json, especially when " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno)
[08:47:12] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52340 bytes in 1.868 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:47:30] <wikibugs>	 (03CR) 10Elukey: Homer: fix Netbox 4 breaking changes (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:47:41] <wikibugs>	 (03CR) 10Vgutierrez: "ok... don't forget to update the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[08:47:42] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:48:25] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Didn't check all the details but if the code is tested and works, LGTM!" [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:48:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:48:34] <wikibugs>	 (03CR) 10Vgutierrez: benthos:cache: encode problematic fields as hex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[08:50:14] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM, the only nit that would be great is to add comments where we use [0] to indicate why. In the future all reviewers will be happy to a" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1050379 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:53:07] <wikibugs>	 (03CR) 10Elukey: "Just to double check - all of this works with python3-pynetbox 6.6 right? Or do we need to test it somewhere?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:54:38] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943487 (10mforns) > The service is up and running in staging, and can be reac...
[08:54:47] <wikibugs>	 (03PS2) 10Slyngshede: LDAP key sync: Improvements to SSH key sync with LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1051293 (https://phabricator.wikimedia.org/T366525)
[08:57:34] <logmsgbot>	 !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1025 at 10% weight T363812', diff saved to https://phabricator.wikimedia.org/P65643 and previous config saved to /var/cache/conftool/dbconfig/20240702-085733-jynus.json
[08:57:37] <stashbot>	 T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812
[08:59:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9943498 (10cmooney) 05Resolved→03Open
[09:00:32] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "this sounds a lot like https://community.letsencrypt.org/t/apache-chain-issues-with-dual-rsa-ecdsa-certificates/153960. Please use `SSLCer" [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) (owner: 10Filippo Giunchedi)
[09:02:44] <wikibugs>	 (03CR) 10Elukey: "Quick question to better understand the code - it would be nice to avoid using the [0] selector throughout the code, since we know that we" [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:03:20] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "BTW the behavior that you're describing is well-known an documented on the Apache httpd documentation in https://httpd.apache.org/docs/cur" [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) (owner: 10Filippo Giunchedi)
[09:10:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9943509 (10cmooney) >>! In T363341#9936269, @Jclark-ctr wrote: > cloudcephosd1039 > 2nd cable serial#20220008 port 1 > cloudcephosd1040 > 2nd cable serial#...
[09:15:09] <logmsgbot>	 !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1025 at 50% weight T363812', diff saved to https://phabricator.wikimedia.org/P65644 and previous config saved to /var/cache/conftool/dbconfig/20240702-091508-jynus.json
[09:15:12] <stashbot>	 T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812
[09:15:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9943513 (10cmooney) 05Open→03Resolved
[09:17:36] <wikibugs>	 (03CR) 10Volans: "I did a quick pass Arnold. The general approach looks good, nothing major. I've left few minor suggestions inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth)
[09:20:13] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker
[09:22:03] <wikibugs>	 (03PS1) 10Volans: Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298
[09:22:03] <wikibugs>	 (03PS1) 10Volans: Images: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299
[09:23:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9943538 (10cmooney) So the change to the timeout has made a big difference, but there are still some small gaps:  {F56165130}  {F5616524...
[09:23:43] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab-settings: v1.6.0 for squash commit templates [puppet] - 10https://gerrit.wikimedia.org/r/1051178 (https://phabricator.wikimedia.org/T366624) (owner: 10Brennen Bearnes)
[09:24:44] <wikibugs>	 (03PS1) 10Cathal Mooney: Increase scrape_timeout for gnmic prometheus to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1051300 (https://phabricator.wikimedia.org/T326322)
[09:26:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:26:47] <wikibugs>	 (03PS2) 10Filippo Giunchedi: librenms: serve chained LE certs [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008)
[09:28:45] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9943566 (10elukey)
[09:29:21] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] librenms: serve chained LE certs [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) (owner: 10Filippo Giunchedi)
[09:29:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] librenms: serve chained LE certs [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) (owner: 10Filippo Giunchedi)
[09:33:02] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943578 (10Sfaci) Great explanation @Scott_French!. I didn't know that. We'll...
[09:34:21] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: serve LE chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/1051301 (https://phabricator.wikimedia.org/T369014)
[09:34:41] <wikibugs>	 (03CR) 10Volans: "Nice! Couple of suggestions inline, but I agree with the approach." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey)
[09:36:28] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] o11y: serve LE chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/1051301 (https://phabricator.wikimedia.org/T369014) (owner: 10Filippo Giunchedi)
[09:38:09] <wikibugs>	 (03PS1) 10Vgutierrez: gerrit: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014)
[09:39:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] o11y: serve LE chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/1051301 (https://phabricator.wikimedia.org/T369014) (owner: 10Filippo Giunchedi)
[09:40:40] <wikibugs>	 (03PS1) 10Vgutierrez: mirrors: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051307 (https://phabricator.wikimedia.org/T369014)
[09:41:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:41:28] <wikibugs>	 (03CR) 10Hashar: [C:03+1] gerrit: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez)
[09:45:51] <wikibugs>	 (03PS1) 10Vgutierrez: orchestrator: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051310 (https://phabricator.wikimedia.org/T369014)
[09:46:33] <vgutierrez>	 marostegui: ^^ could you take care of getting that CR reviewed from somebody in your team?
[09:46:41] <vgutierrez>	 s/from/by/
[09:47:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Increase scrape_timeout for gnmic prometheus to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1051300 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney)
[09:47:45] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1051107 (https://phabricator.wikimedia.org/T367501) (owner: 10Jelto)
[09:47:53] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubemaster[2001-2002].codfw.wmnet with reason: decom
[09:48:05] <wikibugs>	 (03PS1) 10Jelto: gitlab: use ensure latest when cloning the gitlab-exporter repo [puppet] - 10https://gerrit.wikimedia.org/r/1051312 (https://phabricator.wikimedia.org/T354656)
[09:48:08] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubemaster[2001-2002].codfw.wmnet with reason: decom
[09:48:33] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "I don't know how sensible this change is since I don't know much about certificates." [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez)
[09:50:13] <marostegui>	 vgutierrez: checking
[09:50:27] <vgutierrez>	 marostegui: thx <3
[09:50:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] orchestrator: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051310 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez)
[09:52:21] <elukey>	 !log volatile dir on puppetserver1001 with the new point release (12.6) for Bookworm
[09:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Once merged let me know, so I can double check that everything works as expected" [puppet] - 10https://gerrit.wikimedia.org/r/1051310 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez)
[09:53:53] <logmsgbot>	 !log jiji@cumin1002 conftool action : set/pooled=no; selector: name=kubemaster200[1-2].codfw.wmnet
[09:56:44] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] services: update thumbor-plugin Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051150 (owner: 10Elukey)
[09:58:55] <wikibugs>	 (03PS2) 10Volans: Images: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1000)
[10:00:39] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] gitlab: use ensure latest when cloning the gitlab-exporter repo [puppet] - 10https://gerrit.wikimedia.org/r/1051312 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[10:01:40] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 0 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:02:06] <claime>	 !log homer 'cr*codfw*' commit 'T351074'
[10:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:11] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[10:03:41] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1051307 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez)
[10:04:13] <wikibugs>	 (03CR) 10Vgutierrez: "the change at the moment should be a NOOP for gerrit. But if we don't deploy it as soon as acme-chief renews gerrit certificate (it should" [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez)
[10:06:36] <logmsgbot>	 !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1025 at 100% weight T363812', diff saved to https://phabricator.wikimedia.org/P65645 and previous config saved to /var/cache/conftool/dbconfig/20240702-100636-jynus.json
[10:06:39] <stashbot>	 T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812
[10:18:34] <wikibugs>	 (03PS1) 10Elukey: profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317
[10:21:02] <wikibugs>	 (03PS2) 10Elukey: profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317
[10:21:13] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051292 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[10:21:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Increase scrape_timeout for gnmic prometheus to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1051300 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney)
[10:21:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:22:35] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey)
[10:23:03] <wikibugs>	 (03CR) 10Elukey: profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey)
[10:25:40] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-master1003.eqiad.wmnet
[10:26:17] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051292 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[10:27:35] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad
[10:27:54] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad
[10:28:30] <fabfur>	 !log upgrading A:cp-eqiad to haproxy 2.8.10 (T367756)
[10:28:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:33] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[10:32:55] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:dse-k8s-worker
[10:34:45] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1003.eqiad.wmnet
[10:35:48] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons.
[10:36:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:38:50] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey)
[10:39:41] <wikibugs>	 (03PS1) 10Effie Mouzeli: kubernetes: retire kubemaster200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1051321 (https://phabricator.wikimedia.org/T353464)
[10:41:46] <wikibugs>	 (03PS3) 10Volans: data.yaml: Add daphnesmit to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) (owner: 10Slyngshede)
[10:41:51] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:42:01] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:42:24] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:42:43] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:42:51] <wikibugs>	 (03CR) 10Volans: [C:03+2] "Rebased resolving conflicts. Approved on task. Merging." [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) (owner: 10Slyngshede)
[10:43:53] <wikibugs>	 (03PS3) 10Elukey: Allow to save new OS names without them being present on the DB [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744)
[10:44:49] <wikibugs>	 (03PS2) 10Effie Mouzeli: kubernetes: retire kubemaster200[1-2] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051321 (https://phabricator.wikimedia.org/T353464)
[10:45:22] <wikibugs>	 (03CR) 10Elukey: "Thanks a lot for the review!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey)
[10:46:01] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kubernetes: retire kubemaster200[1-2] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051321 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli)
[10:46:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65646 and previous config saved to /var/cache/conftool/dbconfig/20240702-104605-root.json
[10:47:32] <wikibugs>	 (03PS1) 10Effie Mouzeli: kubernetes: retire kubemaster100[1-2] in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051323 (https://phabricator.wikimedia.org/T353464)
[10:48:06] <wikibugs>	 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9943798 (10Marostegui) Just one addition: sanitarium hosts also have replication filters to exclude tables or entire databases (private wikis).
[10:48:10] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9943785 (10Volans) 05Open→03Resolved The above patch has been merged. Within 30 minutes it will be effective. Resolving the task. Feel fre...
[10:48:23] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: retire kubemaster200[1-2] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051321 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli)
[10:49:42] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey)
[10:50:29] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubemaster[2001-2002].codfw.wmnet
[10:54:09] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943809 (10Clement_Goubert) >>! In T361835#9943486, @mforns wrote: >> The serv...
[10:56:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:56:51] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[10:59:03] <wikibugs>	 (03CR) 10Ayounsi: "yeah exactly. The end goal is to have all Netbox API calls in a spicerack module, and avoid direct calls from cookbooks. For example with " [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[11:00:14] <wikibugs>	 (03CR) 10Ayounsi: "I haven't tested it, but I tested similar changes in Homer and that works on Pynetbox 6.6" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[11:00:47] <wikibugs>	 (03CR) 10Elukey: Allow to save new OS names without them being present on the DB (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey)
[11:00:54] <wikibugs>	 (03CR) 10Elukey: Allow to save new OS names without them being present on the DB (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey)
[11:01:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65647 and previous config saved to /var/cache/conftool/dbconfig/20240702-110111-root.json
[11:03:55] <wikibugs>	 (03PS5) 10Jcrespo: dbbackups: Set sql_mode as loose so that invalid enum values are ok [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162)
[11:03:55] <wikibugs>	 (03PS1) 10Jcrespo: backup: Reduce the maximum amount of volumes for es-rw pools [puppet] - 10https://gerrit.wikimedia.org/r/1051324
[11:04:10] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1051325 (https://phabricator.wikimedia.org/T369020)
[11:04:15] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1051326 (https://phabricator.wikimedia.org/T369020)
[11:04:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T364069)', diff saved to https://phabricator.wikimedia.org/P65648 and previous config saved to /var/cache/conftool/dbconfig/20240702-110442-marostegui.json
[11:04:46] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[11:06:43] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1051327 (https://phabricator.wikimedia.org/T369021)
[11:07:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T369021
[11:07:38] <stashbot>	 T369021: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T369021
[11:07:41] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubemaster[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002"
[11:07:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2214 with weight 0 T369021', diff saved to https://phabricator.wikimedia.org/P65649 and previous config saved to /var/cache/conftool/dbconfig/20240702-110750-root.json
[11:07:58] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T369021
[11:08:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1051327 (https://phabricator.wikimedia.org/T369021) (owner: 10Gerrit maintenance bot)
[11:10:28] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Set sql_mode as loose so that invalid enum values are ok [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) (owner: 10Jcrespo)
[11:10:28] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes: move 5 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1051328 (https://phabricator.wikimedia.org/T351074)
[11:10:53] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] backup: Reduce the maximum amount of volumes for es-rw pools [puppet] - 10https://gerrit.wikimedia.org/r/1051324 (owner: 10Jcrespo)
[11:11:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:11:25] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubemaster[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002"
[11:11:25] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:11:26] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubemaster[2001-2002].codfw.wmnet
[11:11:46] <wikibugs>	 (03PS1) 10Elukey: cloud: add default for profile::puppetserver::git::exclude_servers [puppet] - 10https://gerrit.wikimedia.org/r/1051329
[11:12:18] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1051329 (owner: 10Elukey)
[11:12:24] <wikibugs>	 (03CR) 10Elukey: [C:03+2] cloud: add default for profile::puppetserver::git::exclude_servers [puppet] - 10https://gerrit.wikimedia.org/r/1051329 (owner: 10Elukey)
[11:12:29] <claime>	 !log pooling and uncordoning wikikube-worker2025.codfw.wmnet|wikikube-worker2026.codfw.wmnet|wikikube-worker2027.codfw.wmnet|wikikube-worker2028.codfw.wmnet|wikikube-worker2029.codfw.wmnet - T351074
[11:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:31] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[11:12:38] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad
[11:12:39] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2025.codfw.wmnet|wikikube-worker2026.codfw.wmnet|wikikube-worker2027.codfw.wmnet|wikikube-worker2028.codfw.wmnet|wikikube-worker2029.codfw.wmnet),cluster=kubernetes,service=kubesvc
[11:14:25] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad
[11:14:54] <wikibugs>	 (03CR) 10Marostegui: "Sorry I missed this!" [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) (owner: 10Jcrespo)
[11:16:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65650 and previous config saved to /var/cache/conftool/dbconfig/20240702-111616-root.json
[11:16:27] <wikibugs>	 (03PS2) 10Sergio Gimeno: GrowthExperiments: add community updates module flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877)
[11:16:46] <wikibugs>	 (03PS1) 10Jcrespo: backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330
[11:17:06] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330 (owner: 10Jcrespo)
[11:17:41] <wikibugs>	 (03PS2) 10Jcrespo: backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330
[11:17:53] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1008.eqiad.wmnet with OS bullseye
[11:17:53] <wikibugs>	 (03PS3) 10Jcrespo: backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330
[11:19:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P65651 and previous config saved to /var/cache/conftool/dbconfig/20240702-111949-marostegui.json
[11:20:45] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:21:05] <claime>	 !log Uncordoning wikikube-ctrl2001.codfw.wmnet and wikikube-ctrl2002.codfw.wmnet
[11:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:21:50] <wikibugs>	 (03CR) 10Sergio Gimeno: "Right, ty. I was not sure if going with a default of true or false at the time of writing this patch. Derived from the fact of deciding if" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno)
[11:21:54] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:21:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubernetes1051.eqiad.wmnet
[11:22:02] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad
[11:22:05] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad
[11:22:29] <wikibugs>	 (03PS8) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[11:22:29] <wikibugs>	 (03PS1) 10David Caro: ci: enable failing when hiera missing from cloud [puppet] - 10https://gerrit.wikimedia.org/r/1051332
[11:22:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:22:42] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host eventlog1003.eqiad.wmnet
[11:23:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:24:42] <jayme>	 !log switched wikikube production clusters from PSP to PSS for restricted namespaces - T273507
[11:24:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:45] <stashbot>	 T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21 - https://phabricator.wikimedia.org/T273507
[11:24:56] <marostegui>	 !log Starting s6 codfw failover from db2129 to db2214 - T369021
[11:24:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:59] <stashbot>	 T369021: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T369021
[11:25:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2214 to s6 primary T369021', diff saved to https://phabricator.wikimedia.org/P65652 and previous config saved to /var/cache/conftool/dbconfig/20240702-112518-marostegui.json
[11:26:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[11:26:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129 T369021', diff saved to https://phabricator.wikimedia.org/P65653 and previous config saved to /var/cache/conftool/dbconfig/20240702-112616-root.json
[11:26:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:26:27] <icinga-wm>	 PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:26:28] <wikibugs>	 (03CR) 10David Caro: "this was removed from voting in I41fe8738c4d15beecb70753ed7dd76fcea85405a" [puppet] - 10https://gerrit.wikimedia.org/r/1051332 (owner: 10David Caro)
[11:26:36] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host eventlog1003.eqiad.wmnet
[11:27:13] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons.
[11:27:58] <wikibugs>	 (03PS9) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[11:31:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65654 and previous config saved to /var/cache/conftool/dbconfig/20240702-113122-root.json
[11:31:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[11:34:05] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330 (owner: 10Jcrespo)
[11:34:43] <wikibugs>	 (03PS1) 10Marostegui: db2114: No longer a candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1051333 (https://phabricator.wikimedia.org/T362948)
[11:34:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P65655 and previous config saved to /var/cache/conftool/dbconfig/20240702-113457-marostegui.json
[11:35:11] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2114: No longer a candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1051333 (https://phabricator.wikimedia.org/T362948) (owner: 10Marostegui)
[11:36:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Long schema change
[11:36:05] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Long schema change
[11:37:33] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons.
[11:37:55] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: use ensure latest when cloning the gitlab-exporter repo [puppet] - 10https://gerrit.wikimedia.org/r/1051312 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[11:40:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: move 5 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1051328 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:41:12] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] kubernetes: move 5 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1051328 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert)
[11:41:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:42:32] <icinga-wm>	 PROBLEM - Host kubernetes1051 is DOWN: PING CRITICAL - Packet loss = 100%
[11:42:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubernetes1051.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1051.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:43:05] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2307 to wikikube-worker2030
[11:43:11] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[11:43:42] <wikibugs>	 (03PS1) 10Jelto: gitlab: use dedicated ensure for gitlab-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1051335 (https://phabricator.wikimedia.org/T354656)
[11:44:35] <logmsgbot>	 !log root@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1008.eqiad.wmnet with OS bullseye
[11:46:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:46:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65656 and previous config saved to /var/cache/conftool/dbconfig/20240702-114627-root.json
[11:46:31] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Reference widget: check for undefined config [extensions/WikibaseMediaInfo] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051202 (https://phabricator.wikimedia.org/T368736) (owner: 10Jforrester)
[11:48:26] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1008.eqiad.wmnet with OS bullseye
[11:50:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T364069)', diff saved to https://phabricator.wikimedia.org/P65657 and previous config saved to /var/cache/conftool/dbconfig/20240702-115003-marostegui.json
[11:50:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[11:50:07] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[11:50:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[11:50:25] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2307 to wikikube-worker2030 - cgoubert@cumin1002"
[11:50:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65658 and previous config saved to /var/cache/conftool/dbconfig/20240702-115026-marostegui.json
[11:52:53] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2307 to wikikube-worker2030 - cgoubert@cumin1002"
[11:52:54] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:52:54] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2030
[11:53:29] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Reenable es-readonly for one-time es5 section backup [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812)
[11:54:03] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2030
[11:54:12] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2307 to wikikube-worker2030
[11:55:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2309 to wikikube-worker2031
[11:55:09] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[11:57:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbbackups: Reenable es-readonly for one-time es5 section backup [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[11:57:29] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2309 to wikikube-worker2031 - cgoubert@cumin1002"
[11:58:16] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[11:58:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[11:58:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2309 to wikikube-worker2031 - cgoubert@cumin1002"
[11:58:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:58:41] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2031
[11:58:53] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2031
[11:58:54] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] gitlab: use dedicated ensure for gitlab-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1051335 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[11:59:01] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2309 to wikikube-worker2031
[11:59:07] <wikibugs>	 (03PS1) 10Ayounsi: Preseed: set /32 netmask for virtual ranges [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152)
[11:59:24] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[11:59:25] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2365 to wikikube-worker2032
[11:59:32] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[11:59:37] <wikibugs>	 (03PS1) 10Jforrester: Drop bare-metal servers from Wikimedia Debug tool config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051343 (https://phabricator.wikimedia.org/T367949)
[11:59:41] <wikibugs>	 (03PS1) 10Jforrester: mwdebug: Change various uses to mw-on-k8s version [puppet] - 10https://gerrit.wikimedia.org/r/1051344
[11:59:42] <wikibugs>	 (03PS1) 10Jforrester: mwdebug: Drop mwdebug\d{4} for bare metal servers [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949)
[11:59:48] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: use dedicated ensure for gitlab-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1051335 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[12:00:01] <jynus>	 I got CI error on profile::gitlab, any recent change there?
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1200)
[12:00:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:00:25] <jelto>	 jynus: fix is merging. should be fixed in a sec / after rebase
[12:00:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:00:32] <jynus>	 jelto: no worries then
[12:00:47] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:00:49] <jynus>	 I was just confused because my patch was so trivial!
[12:00:59] <vgutierrez>	 marostegui: merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051310 now
[12:01:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] orchestrator: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051310 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez)
[12:01:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[12:01:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65659 and previous config saved to /var/cache/conftool/dbconfig/20240702-120133-root.json
[12:01:34] <marostegui>	 Ok vgutierrez
[12:01:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[12:01:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2365 to wikikube-worker2032 - cgoubert@cumin1002"
[12:02:25] <jelto>	 puppet/CI/pcc for profile::gitlab should be happy again
[12:02:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Preseed: set /32 netmask for virtual ranges [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:03:15] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2365 to wikikube-worker2032 - cgoubert@cumin1002"
[12:03:15] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:03:15] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2032
[12:03:17] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] mirrors: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051307 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez)
[12:04:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mwdebug: Change various uses to mw-on-k8s version [puppet] - 10https://gerrit.wikimedia.org/r/1051344 (owner: 10Jforrester)
[12:04:30] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2032
[12:04:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2365 to wikikube-worker2032
[12:05:06] <wikibugs>	 (03PS2) 10Jforrester: [wikifunctions] Grant wikifunctions-staff enum and converter rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610)
[12:05:08] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage
[12:05:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610) (owner: 10Jforrester)
[12:05:32] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[12:05:33] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2392 to wikikube-worker2033
[12:05:40] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[12:05:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mwdebug: Drop mwdebug\d{4} for bare metal servers [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester)
[12:07:25] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad
[12:07:41] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage
[12:08:08] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2392 to wikikube-worker2033 - cgoubert@cumin1002"
[12:08:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] dbbackups: Reenable es-readonly for one-time es5 section backup [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[12:09:13] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad
[12:09:25] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2392 to wikikube-worker2033 - cgoubert@cumin1002"
[12:09:25] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:09:25] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2033
[12:09:27] <wikibugs>	 (03PS10) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[12:09:36] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2033
[12:09:45] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2392 to wikikube-worker2033
[12:10:49] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Reenable es-readonly for one-time es5 section backup [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo)
[12:11:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2393 to wikikube-worker2034
[12:11:17] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[12:12:22] <wikibugs>	 (03PS2) 10Ayounsi: Preseed: set /32 netmask for virtual ranges [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152)
[12:12:47] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:14:13] <wikibugs>	 (03Merged) 10jenkins-bot: Reference widget: check for undefined config [extensions/WikibaseMediaInfo] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051202 (https://phabricator.wikimedia.org/T368736) (owner: 10Jforrester)
[12:14:54] <logmsgbot>	 !log jforrester@deploy1002 Started scap sync-world: Backport for [[gerrit:1051202|Reference widget: check for undefined config (T368736)]]
[12:14:57] <stashbot>	 T368736: Structured Data add reference not working - https://phabricator.wikimedia.org/T368736
[12:15:29] <wikibugs>	 (03CR) 10David Caro: "Ready for reviews, passes the tests and passes in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[12:15:39] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9944101 (10ABran-WMF) [[ https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-d...
[12:15:51] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2393 to wikikube-worker2034 - cgoubert@cumin1002"
[12:15:54] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[12:16:01] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:16:15] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on lists1001.wikimedia.org with reason: Pre-decommissioning lists1001
[12:16:18] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on lists1001.wikimedia.org with reason: Pre-decommissioning lists1001
[12:16:34] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9944103 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=410ac7b2-3327-4734-8665-8ceb56bdc810) set by eoghan@cumin1002 fo...
[12:16:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65660 and previous config saved to /var/cache/conftool/dbconfig/20240702-121638-root.json
[12:17:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2393 to wikikube-worker2034 - cgoubert@cumin1002"
[12:17:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:17:13] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2034
[12:17:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2034
[12:17:31] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2393 to wikikube-worker2034
[12:17:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2030.codfw.wmnet with OS bullseye
[12:17:54] <wikibugs>	 (03CR) 10David Caro: [C:03+2] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[12:18:14] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2031.codfw.wmnet with OS bullseye
[12:18:34] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2032.codfw.wmnet with OS bullseye
[12:18:50] <wikibugs>	 (03CR) 10Ayounsi: "D-I bug fixed and deployed in bookworm-installer - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1064005" [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:18:54] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2033.codfw.wmnet with OS bullseye
[12:19:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2034.codfw.wmnet with OS bullseye
[12:19:17] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1051202|Reference widget: check for undefined config (T368736)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:19:41] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Continuing with sync
[12:19:49] <wikibugs>	 (03CR) 10David Caro: [C:03+2] "Passing in tools too:" [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[12:19:56] <wikibugs>	 (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1051348 (https://phabricator.wikimedia.org/T367756)
[12:20:50] <icinga-wm>	 PROBLEM - Host mw2307 is DOWN: PING CRITICAL - Packet loss = 100%
[12:21:20] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9944135 (10eoghan) lists1001 has been powered off, it will stay off for 1 week and then I'll decommission it fully on Tuesday, 9th July, aft...
[12:22:14] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9944140 (10eoghan) 05In progress→03Resolved I think we can close this, since the puppet module now instal...
[12:22:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:24:53] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1051202|Reference widget: check for undefined config (T368736)]] (duration: 09m 59s)
[12:24:56] <stashbot>	 T368736: Structured Data add reference not working - https://phabricator.wikimedia.org/T368736
[12:25:02] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons.
[12:25:45] <marostegui>	 !log Deploy schema change on db2129 s6 codfw dbmaint T367856
[12:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:47] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[12:25:51] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes
[12:25:52] <icinga-wm>	 RECOVERY - Host mw2307 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms
[12:28:00] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:28:26] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:28:44] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] gerrit: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez)
[12:30:20] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Preseed: set /32 netmask for virtual ranges [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:30:51] <icinga-wm>	 PROBLEM - Host mw2307 is DOWN: PING CRITICAL - Packet loss = 100%
[12:31:19] <icinga-wm>	 RECOVERY - Host mw2307 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms
[12:31:27] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:33:59] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage
[12:34:06] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2031.codfw.wmnet with reason: host reimage
[12:34:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage
[12:34:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2033.codfw.wmnet with reason: host reimage
[12:34:18] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2032.codfw.wmnet with reason: host reimage
[12:34:41] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: prometheus: scrape kyverno metrics [puppet] - 10https://gerrit.wikimedia.org/r/1051351 (https://phabricator.wikimedia.org/T368515)
[12:35:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051351 (https://phabricator.wikimedia.org/T368515) (owner: 10Arturo Borrero Gonzalez)
[12:36:25] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage
[12:36:54] <wikibugs>	 (03Abandoned) 10RhinosF1: remove s10 references [software/conftool] - 10https://gerrit.wikimedia.org/r/708632 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1)
[12:37:22] <wikibugs>	 (03Abandoned) 10RhinosF1: test [puppet] - 10https://gerrit.wikimedia.org/r/980470 (owner: 10RhinosF1)
[12:39:52] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2031.codfw.wmnet with reason: host reimage
[12:40:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: prometheus: scrape kyverno metrics [puppet] - 10https://gerrit.wikimedia.org/r/1051351 (https://phabricator.wikimedia.org/T368515) (owner: 10Arturo Borrero Gonzalez)
[12:40:25] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:40:51] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:41:25] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:41:45] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9944322 (10JMeybohm)
[12:41:51] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kubernetes1051.eqiad.wmnet
[12:41:53] <wikibugs>	 (03PS1) 10Ayounsi: Routed ganeti: remove /23 -> /32 workaround [puppet] - 10https://gerrit.wikimedia.org/r/1051352
[12:42:00] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage
[12:43:12] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Routed ganeti: remove /23 -> /32 workaround [puppet] - 10https://gerrit.wikimedia.org/r/1051352 (owner: 10Ayounsi)
[12:44:16] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:44:20] <effie>	 !log decom eqiad old kubemasters - T353464
[12:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:26] <stashbot>	 T353464: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464
[12:45:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T367856)', diff saved to https://phabricator.wikimedia.org/P65661 and previous config saved to /var/cache/conftool/dbconfig/20240702-124517-marostegui.json
[12:45:25] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[12:45:39] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2033.codfw.wmnet with reason: host reimage
[12:46:04] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubemaster[1001-1002].eqiad.wmnet with reason: decom
[12:46:18] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubemaster[1001-1002].eqiad.wmnet with reason: decom
[12:49:09] <wikibugs>	 (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[12:49:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2032.codfw.wmnet with reason: host reimage
[12:49:50] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] kubernetes: retire kubemaster100[1-2] in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051323 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli)
[12:49:56] <logmsgbot>	 !log jiji@cumin1002 conftool action : set/pooled=no; selector: name=kubemaster100[1-2].eqiad.wmnet
[12:50:53] <icinga-wm>	 PROBLEM - Host mw2307 is DOWN: PING CRITICAL - Packet loss = 100%
[12:53:23] <icinga-wm>	 RECOVERY - Host mw2307 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms
[12:53:33] <icinga-wm>	 PROBLEM - Host mw2309 is DOWN: PING CRITICAL - Packet loss = 100%
[12:54:14] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] GrowthExperiments: add community updates module flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno)
[12:55:14] <logmsgbot>	 !log jiji@cumin1002 conftool action : set/pooled=inactive; selector: name=kubemaster100[1-2].eqiad.wmnet
[12:56:03] <icinga-wm>	 RECOVERY - Host mw2309 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms
[12:56:37] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2030.codfw.wmnet with OS bullseye
[12:57:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Enable EntitySchema data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE))
[12:57:39] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[12:59:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2031.codfw.wmnet with OS bullseye
[12:59:42] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2034.codfw.wmnet with OS bullseye
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1300).
[13:00:05] <jouncebot>	 Lucas_WMDE and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:07] <Lucas_WMDE>	 o/
[13:00:09] <Lucas_WMDE>	 I can deploy!
[13:00:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P65662 and previous config saved to /var/cache/conftool/dbconfig/20240702-130024-marostegui.json
[13:00:55] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Enable EntitySchema data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157)
[13:01:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:01:34] <urbanecm>	 Lucas_WMDE: you beated me to it. FWIW, I added a last-time addition.
[13:01:49] <urbanecm>	 nothing to test on that, feel free to ship it with something else if needed.
[13:02:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE))
[13:02:43] <Lucas_WMDE>	 urbanecm: alright, looking
[13:03:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EntitySchema data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE))
[13:03:18] <Lucas_WMDE>	 okay, looks fine, I’ll deploy that together with the wikifunctions change then
[13:03:22] <Lucas_WMDE>	 unless James_F wants to do that one
[13:03:34] <James_F>	 Nah, I'll stand back and let you sling them out together.
[13:03:38] <Lucas_WMDE>	 ok ^^
[13:03:41] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Allow to save new OS names without them being present on the DB [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey)
[13:03:41] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1042209|Enable EntitySchema data type on Wikidata (T332157)]]
[13:03:44] <stashbot>	 T332157: [ES-M2]: Enable new EntitySchema data type on Wikidata - https://phabricator.wikimedia.org/T332157
[13:03:52] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: enable pt-heartbeat monitoring through mysqld-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[13:04:35] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2033.codfw.wmnet with OS bullseye
[13:05:28] <wikibugs>	 (03Merged) 10jenkins-bot: Allow to save new OS names without them being present on the DB [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey)
[13:06:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1042209|Enable EntitySchema data type on Wikidata (T332157)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:07:31] <wikibugs>	 (03PS2) 10Volans: Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298
[13:08:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'.
[13:08:25] <logmsgbot>	 !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[13:08:28] <wikibugs>	 (03CR) 10TChin: EventStreamConfig: Add hive ingestion defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[13:09:00] <Lucas_WMDE>	 testing…
[13:09:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync
[13:09:21] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubemaster[1001-1002].eqiad.wmnet
[13:09:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2032.codfw.wmnet with OS bullseye
[13:11:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:13:38] <wikibugs>	 (03PS3) 10Volans: Images: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744)
[13:14:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1042209|Enable EntitySchema data type on Wikidata (T332157)]] (duration: 10m 54s)
[13:14:38] <stashbot>	 T332157: [ES-M2]: Enable new EntitySchema data type on Wikidata - https://phabricator.wikimedia.org/T332157
[13:15:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: mariadb: enable pt-heartbeat monitoring through mysqld-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[13:15:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P65663 and previous config saved to /var/cache/conftool/dbconfig/20240702-131531-marostegui.json
[13:16:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610) (owner: 10Jforrester)
[13:16:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno)
[13:16:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:16:26] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[13:16:59] <wikibugs>	 (03Merged) 10jenkins-bot: [wikifunctions] Grant wikifunctions-staff enum and converter rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610) (owner: 10Jforrester)
[13:17:02] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: add community updates module flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno)
[13:17:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1048059|[wikifunctions] Grant wikifunctions-staff enum and converter rights (T366610 T367270)]], [[gerrit:1047961|GrowthExperiments: add community updates module flag (T365877)]]
[13:17:36] <stashbot>	 T366610: Restrict creation of instances of Types with identity keys to wikilambda-create-enum-value - https://phabricator.wikimedia.org/T366610
[13:17:37] <stashbot>	 T367270: Add rights for creation and edition of type converters (Z46 and Z46) - https://phabricator.wikimedia.org/T367270
[13:17:37] <stashbot>	 T365877: Community updates module: Title & Body text - https://phabricator.wikimedia.org/T365877
[13:18:52] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubemaster[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002"
[13:19:15] <wikibugs>	 (03PS3) 10Volans: Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298
[13:19:15] <wikibugs>	 (03PS4) 10Volans: Images: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744)
[13:19:49] <wikibugs>	 (03CR) 10Volans: "If we decide to go with this approach we can add the tests." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans)
[13:20:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 sgimeno, jforrester, lucaswerkmeister-wmde: Backport for [[gerrit:1048059|[wikifunctions] Grant wikifunctions-staff enum and converter rights (T366610 T367270)]], [[gerrit:1047961|GrowthExperiments: add community updates module flag (T365877)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:20:28] <wikibugs>	 (03PS8) 10Elukey: Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) (owner: 10Jforrester)
[13:21:25] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubemaster[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002"
[13:21:25] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:21:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:21:26] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubemaster[1001-1002].eqiad.wmnet
[13:21:52] <Lucas_WMDE>	 James_F: want to test the permission changes?
[13:21:56] <James_F>	 Lucas_WMDE: Sure.
[13:21:57] <Lucas_WMDE>	 https://www.wikifunctions.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups|restrictions&format=json&formatversion=2 looks good to me, at least
[13:22:33] <wikibugs>	 (03CR) 10Elukey: "James I took the liberty to rebase and modify again the versions, IIUC from Joe the -sX suffix was only for security releases/concerns, so" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) (owner: 10Jforrester)
[13:22:38] <James_F>	 Lucas_WMDE: LGTM.
[13:22:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 sgimeno, jforrester, lucaswerkmeister-wmde: Continuing with sync
[13:22:51] <claime>	 !log homer 'cr*codfw*' commit 'T351074'
[13:22:52] <Lucas_WMDE>	 alright, thanks for testing!
[13:22:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:53] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[13:23:23] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: reduce client-side rate-limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051358 (https://phabricator.wikimedia.org/T362310)
[13:23:36] <wikibugs>	 (03CR) 10Jforrester: "> James I took the liberty to rebase and modify again the versions, IIUC from Joe the -sX suffix was only for security releases/concerns, " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) (owner: 10Jforrester)
[13:26:24] <wikibugs>	 (03PS2) 10Effie Mouzeli: kubernetes: retire kubemaster100[1-2] in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051323 (https://phabricator.wikimedia.org/T353464)
[13:27:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1048059|[wikifunctions] Grant wikifunctions-staff enum and converter rights (T366610 T367270)]], [[gerrit:1047961|GrowthExperiments: add community updates module flag (T365877)]] (duration: 10m 22s)
[13:28:00] <stashbot>	 T366610: Restrict creation of instances of Types with identity keys to wikilambda-create-enum-value - https://phabricator.wikimedia.org/T366610
[13:28:00] <stashbot>	 T367270: Add rights for creation and edition of type converters (Z46 and Z46) - https://phabricator.wikimedia.org/T367270
[13:28:02] <stashbot>	 T365877: Community updates module: Title & Body text - https://phabricator.wikimedia.org/T365877
[13:29:11] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: retire kubemaster100[1-2] in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051323 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli)
[13:29:24] <wikibugs>	 (03PS1) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361
[13:29:53] <Lucas_WMDE>	 James_F, urbanecm: should be deployed now
[13:29:58] <urbanecm>	 thanks
[13:30:08] <Lucas_WMDE>	 (well, whenever beta next runs a config update, I guess ^^)
[13:30:23] <wikibugs>	 (03PS2) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418)
[13:30:27] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9944546 (10xcollazo) >>! In T361835#9943486, @mforns wrote: > ... >> The servi...
[13:30:29] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:30:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T367856)', diff saved to https://phabricator.wikimedia.org/P65664 and previous config saved to /var/cache/conftool/dbconfig/20240702-133038-marostegui.json
[13:30:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[13:30:43] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[13:30:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[13:31:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T367856)', diff saved to https://phabricator.wikimedia.org/P65665 and previous config saved to /var/cache/conftool/dbconfig/20240702-133100-marostegui.json
[13:31:25] <jinxer-wm>	 FIRING: [16x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:33:32] <wikibugs>	 (03CR) 10Herron: [C:03+2] thanos: increase query frontend and store cache sizes [puppet] - 10https://gerrit.wikimedia.org/r/1051177 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron)
[13:33:39] <wikibugs>	 (03PS3) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418)
[13:35:29] <claime>	 !log Pooling and uncordoning wikikube-worker2030.codfw.wmnet wikikube-worker2031.codfw.wmnet wikikube-worker2032.codfw.wmnet wikikube-worker2033.codfw.wmnet wikikube-worker2034.codfw.wmnet - T351074
[13:35:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:32] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[13:35:51] <wikibugs>	 (03PS4) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418)
[13:35:59] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2030.codfw.wmnet|wikikube-worker2031.codfw.wmnet|wikikube-worker2032.codfw.wmnet|wikikube-worker2033.codfw.wmnet|wikikube-worker2034.codfw.wmnet),cluster=kubernetes,service=kubesvc
[13:36:25] <jinxer-wm>	 FIRING: [15x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:39:16] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:39:29] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes
[13:39:32] <wikibugs>	 (03PS1) 10Jforrester: Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051364 (https://phabricator.wikimedia.org/T368892)
[13:39:48] <wikibugs>	 (03PS2) 10Jforrester: Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051364 (https://phabricator.wikimedia.org/T368892)
[13:39:58] <James_F>	 jouncebot: nowandnext
[13:39:59] <jouncebot>	 For the next 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1300)
[13:39:59] <jouncebot>	 In 1 hour(s) and 20 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1500)
[13:40:08] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051364 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester)
[13:41:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051364 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester)
[13:41:24] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes
[13:41:25] <jinxer-wm>	 FIRING: [16x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:42:19] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[13:42:46] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[13:42:51] <wikibugs>	 (03PS1) 10Effie Mouzeli: cumin: fix kube-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1051365 (https://phabricator.wikimedia.org/T353464)
[13:43:08] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[13:43:12] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:44:02] <wikibugs>	 (03CR) 10Aqu: [C:03+1] "Thanks. Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[13:44:19] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[13:44:56] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[13:45:27] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "lgtm" [software/bitu] - 10https://gerrit.wikimedia.org/r/1051293 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede)
[13:45:48] <wikibugs>	 (03CR) 10Jgiannelos: "This is a bit tricky. The part were we remove the `exec` parts that send requests is straightforward. What I am not very confident is the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos)
[13:46:00] <wikibugs>	 (03PS1) 10Ayounsi: DHCP: Add support for routed ganeti subnets [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152)
[13:46:03] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[13:46:14] <icinga-wm>	 PROBLEM - Host an-druid1001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:46:16] <icinga-wm>	 RECOVERY - Host an-druid1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[13:46:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cumin: fix kube-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1051365 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli)
[13:46:25] <jinxer-wm>	 FIRING: [17x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:46:53] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[13:47:05] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey)
[13:47:25] <wikibugs>	 (03PS2) 10Effie Mouzeli: cumin: fix kube-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1051365 (https://phabricator.wikimedia.org/T353464)
[13:47:55] <wikibugs>	 (03PS2) 10Ayounsi: DHCP: Add support for routed ganeti subnets [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152)
[13:48:01] <wikibugs>	 (03PS1) 10Bking: wdqs: detune blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405)
[13:49:00] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey)
[13:49:58] <wikibugs>	 (03PS1) 10Ssingh: varnish: make donate.m redirect permanent and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645)
[13:50:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[13:50:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[13:50:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[13:51:25] <jinxer-wm>	 FIRING: [17x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:51:42] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Cookbooks: fix Netbox 4 breaking changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[13:51:44] <wikibugs>	 (03PS3) 10Filippo Giunchedi: page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563)
[13:51:49] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Re-apply "Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051373
[13:51:54] <godog>	 jouncebot: now and next
[13:51:54] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1300)
[13:51:54] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Re-apply "Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051373 (owner: 10Jforrester)
[13:52:32] <wikibugs>	 (03PS1) 10Clément Goubert: mw-misc: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051367 (https://phabricator.wikimedia.org/T365265)
[13:52:43] <wikibugs>	 (03PS1) 10Clément Goubert: mw-wikifunctions: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051368 (https://phabricator.wikimedia.org/T365265)
[13:53:12] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:55:25] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: dns6001: reduce anycast_hc logging level and backups [puppet] - 10https://gerrit.wikimedia.org/r/1050626 (owner: 10Ssingh)
[13:56:05] <wikibugs>	 (03PS3) 10Arnaudb: mysql: pt-heartbeat alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278)
[13:56:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:56:38] <wikibugs>	 (03CR) 10Arnaudb: mysql: pt-heartbeat alerting rules (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[13:56:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Very nice! TIL managers in django :)" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298 (owner: 10Volans)
[13:57:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mysql: pt-heartbeat alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[13:58:46] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9944795 (10cmooney)
[13:58:47] <effie>	 !log decom old eqiad and codfw kubetcd hosts 
[13:58:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:42] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 145 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[13:59:59] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9944811 (10cmooney)
[14:00:32] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051348 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[14:01:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:01:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:01:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:01:51] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org
[14:01:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:02:02] <sukhe>	 !log restart anycast-hc on dns6001
[14:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:19] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Re-apply "Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051373 (owner: 10Jforrester)
[14:02:35] <wikibugs>	 (03CR) 10Elukey: "Looks really great, have you tried to dry-run it via test-cookbook to double check that everything looks good? (see https://wikitech.wikim" [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh)
[14:03:04] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org
[14:03:07] <wikibugs>	 (03PS4) 10Arnaudb: mysql: pt-heartbeat alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278)
[14:03:08] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:03:48] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:04:05] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:04:08] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=recdns
[14:04:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad
[14:04:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad
[14:04:59] <logmsgbot>	 !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply
[14:05:21] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking)
[14:05:25] <wikibugs>	 (03PS1) 10David Caro: replica_cnf: add tools to url [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909)
[14:05:30] <logmsgbot>	 !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply
[14:05:45] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:05:53] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:05:58] <wikibugs>	 (03CR) 10Volans: [C:03+2] Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298 (owner: 10Volans)
[14:05:59] <logmsgbot>	 !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply
[14:06:18] <logmsgbot>	 !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply
[14:06:20] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=recdns
[14:06:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:06:49] <wikibugs>	 (03PS2) 10David Caro: replica_cnf: add tools to url [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909)
[14:06:49] <wikibugs>	 (03PS1) 10David Caro: ceph: update the cloudcephosd1008 iface names [puppet] - 10https://gerrit.wikimedia.org/r/1051376 (https://phabricator.wikimedia.org/T348643)
[14:07:01] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:07:30] <wikibugs>	 (03CR) 10David Caro: [C:03+2] ceph: update the cloudcephosd1008 iface names [puppet] - 10https://gerrit.wikimedia.org/r/1051376 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro)
[14:07:34] <wikibugs>	 (03PS1) 10Ssingh: Revert "hiera: dns6001: reduce anycast_hc logging level and backups" [puppet] - 10https://gerrit.wikimedia.org/r/1051377
[14:10:10] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "hiera: dns6001: reduce anycast_hc logging level and backups" [puppet] - 10https://gerrit.wikimedia.org/r/1051377 (owner: 10Ssingh)
[14:10:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] replica_cnf: add tools to url [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909) (owner: 10David Caro)
[14:11:07] <wikibugs>	 (03PS2) 10Bking: wdqs: detune blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405)
[14:11:11] <wikibugs>	 (03Merged) 10jenkins-bot: Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298 (owner: 10Volans)
[14:11:14] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking)
[14:11:27] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: decom
[14:11:28] <logmsgbot>	 !log jiji@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2 days, 0:00:00 on 6 hosts with reason: decom
[14:12:12] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: decom
[14:12:19] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: decom
[14:12:46] <wikibugs>	 (03PS1) 10Elukey: docker::reporter: update exclude rules [puppet] - 10https://gerrit.wikimedia.org/r/1051379 (https://phabricator.wikimedia.org/T367427)
[14:13:39] <wikibugs>	 (03CR) 10Ssingh: "Thanks! I did test-cookbook -c 1049950 --dry-run sre.dns.roll-restart-ntp --reason 'testing dry run' --alias dnsbox restart_daemons. Outpu" [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh)
[14:13:50] <wikibugs>	 (03PS2) 10Elukey: docker::reporter: update exclude rules [puppet] - 10https://gerrit.wikimedia.org/r/1051379 (https://phabricator.wikimedia.org/T367427)
[14:15:12] <wikibugs>	 (03CR) 10Elukey: [C:03+1] cookbooks/sre/dns: add a cookbook for roll restart of ntpd.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh)
[14:15:49] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1008.eqiad.wmnet with OS bullseye
[14:15:56] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9944984 (10cmooney)
[14:16:10] <wikibugs>	 (03PS3) 10David Caro: replica_cnf: add tools to url [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909)
[14:16:58] <wikibugs>	 (03PS1) 10Effie Mouzeli: Remove kubetcd100 from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1051380 (https://phabricator.wikimedia.org/T353464)
[14:19:31] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1008.eqiad.wmnet
[14:19:49] <wikibugs>	 (03PS1) 10Ssingh: dnsbox: set anycast-hc num_backups to one [puppet] - 10https://gerrit.wikimedia.org/r/1051381
[14:20:11] <wikibugs>	 (03CR) 10Ssingh: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh)
[14:21:04] <wikibugs>	 (03PS2) 10Ssingh: dnsbox: set anycast-hc num_backups to one [puppet] - 10https://gerrit.wikimedia.org/r/1051381
[14:21:39] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3138/co" [puppet] - 10https://gerrit.wikimedia.org/r/1051381 (owner: 10Ssingh)
[14:22:28] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: openstack: nova-compute: remove support for legacy NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1049200 (https://phabricator.wikimedia.org/T319184)
[14:22:38] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: enable pt-heartbeat monitoring through mysqld-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[14:23:12] <wikibugs>	 (03PS2) 10Effie Mouzeli: Remove kubetcd* from etcd SRV records (eqiad+codfw) [dns] - 10https://gerrit.wikimedia.org/r/1051380 (https://phabricator.wikimedia.org/T353464)
[14:23:50] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9945028 (10Volans)
[14:25:13] <wikibugs>	 (03CR) 10David Caro: "Tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909) (owner: 10David Caro)
[14:25:17] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] cookbooks/sre/dns: add a cookbook for roll restart of ntpd.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh)
[14:26:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: nova-compute: remove support for legacy NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1049200 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[14:26:21] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Remove kubetcd* from etcd SRV records (eqiad+codfw) [dns] - 10https://gerrit.wikimedia.org/r/1051380 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli)
[14:26:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:28:26] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1008.eqiad.wmnet
[14:29:28] <wikibugs>	 (03CR) 10David Caro: [C:03+2] "Passing in tools too:" [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909) (owner: 10David Caro)
[14:30:04] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9945048 (10aborrero) 05Open→03Stalled marking as stalled, because the work on ceph nodes wont be progressing for a while.
[14:32:53] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945091 (10CDanis) >>! In T348643#9931318, @dcaro wrote: > Any ideas/recommendations on how to proceed next? >  > I...
[14:34:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] Remove kubetcd* from etcd SRV records (eqiad+codfw) [dns] - 10https://gerrit.wikimedia.org/r/1051380 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli)
[14:35:09] <wikibugs>	 (03PS3) 10Clément Goubert: P:kubernetes:node: Autorestart ferm.service [puppet] - 10https://gerrit.wikimedia.org/r/1051378 (https://phabricator.wikimedia.org/T354855)
[14:35:49] <wikibugs>	 (03PS1) 10Volans: admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027)
[14:36:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) (owner: 10Volans)
[14:37:02] <wikibugs>	 (03PS1) 10Elukey: knative: upgrade all images to Bullseye and Golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359)
[14:37:21] <wikibugs>	 (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey)
[14:37:41] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubetcd[1004-1006].eqiad.wmnet
[14:38:08] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm
[14:39:13] <wikibugs>	 (03PS2) 10Volans: admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027)
[14:39:16] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:41:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mw-wikifunctions: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051368 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[14:41:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mw-misc: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051367 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[14:43:49] <claime>	 jouncebot: nowandnext
[14:43:49] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 16 minute(s)
[14:43:49] <jouncebot>	 In 0 hour(s) and 16 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1500)
[14:43:55] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-misc: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051367 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[14:44:05] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-wikifunctions: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051368 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[14:44:12] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: installs backport mysqld-exporter on deb11 [puppet] - 10https://gerrit.wikimedia.org/r/1051388 (https://phabricator.wikimedia.org/T367278)
[14:45:02] <wikibugs>	 (03Merged) 10jenkins-bot: mw-misc: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051367 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[14:45:04] <wikibugs>	 (03Merged) 10jenkins-bot: mw-wikifunctions: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051368 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[14:45:09] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: enable pt-heartbeat monitoring through mysqld-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[14:45:33] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[14:46:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:47:50] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubetcd[2004-2006].codfw.wmnet
[14:48:03] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1051348 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[14:48:12] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubetcd[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002"
[14:49:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) (owner: 10Volans)
[14:49:48] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) (owner: 10Volans)
[14:50:25] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[14:51:14] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubetcd[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002"
[14:51:14] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:51:15] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubetcd[1004-1006].eqiad.wmnet
[14:51:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9945244 (10Jhancock.wm) @cmooney got sretest2002 on lsw-d4, ports 44 and 45. 10G card.
[14:52:09] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[14:52:20] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:52:35] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[14:52:43] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes
[14:52:58] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[14:53:26] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[14:53:48] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[14:53:54] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[14:54:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9945268 (10Papaul) @elukey we will work on this more tomorrow during the meeting .  Thanks
[14:54:33] <wikibugs>	 (03CR) 10Volans: [C:03+2] admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) (owner: 10Volans)
[14:55:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65666 and previous config saved to /var/cache/conftool/dbconfig/20240702-145542-marostegui.json
[14:55:45] <fabfur>	 !log upgrading A:cp-esams to haproxy 2.8.10 (T367756)
[14:55:46] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[14:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:48] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[14:55:55] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams
[14:55:57] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams
[14:56:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:56:54] <wikibugs>	 (03CR) 10Dzahn: "Wanna share what the actual error is? We had similar cases that turned out to be legit things that can be fixed." [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking)
[14:57:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, though see inline for some considerations" [puppet] - 10https://gerrit.wikimedia.org/r/1051388 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[14:58:05] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage
[14:58:14] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.dns.netbox
[14:58:28] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9945280 (10cmooney) >>! In T367512#9945244, @Jhancock.wm wrote: > @cmooney got sretest2002 on lsw-d4, ports 44 and 45. 10G card.   Awesome thank...
[14:58:36] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[14:59:16] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:28] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9945273 (10Volans) 05In progress→03Resolved @cwylo this is now done, I'm resolving the task. Within 30 minutes the change should be...
[14:59:42] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[15:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1500).
[15:00:47] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945281 (10dcaro) >>! In T348643#9945091, @CDanis wrote: >>>! In T348643#9931318, @dcaro wrote: >> Any ideas/recomme...
[15:00:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Cabling for FR - https://phabricator.wikimedia.org/T368940#9945284 (10Papaul)
[15:02:07] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubetcd[2004-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002"
[15:02:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Cabling for FR - https://phabricator.wikimedia.org/T368940#9945286 (10Papaul)
[15:03:04] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage
[15:05:07] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945304 (10CDanis) Yeah okay, that's all pretty messy to potentially clean up from.  Have you tried the `ceph-syn` t...
[15:05:54] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubetcd[2004-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002"
[15:05:54] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:05:55] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubetcd[2004-2006].codfw.wmnet
[15:06:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:08:14] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:09:16] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] P:kubernetes:node: Autorestart ferm.service [puppet] - 10https://gerrit.wikimedia.org/r/1051378 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert)
[15:09:31] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update thumbor-plugin Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051150 (owner: 10Elukey)
[15:10:11] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] P:kubernetes:node: Autorestart ferm.service [puppet] - 10https://gerrit.wikimedia.org/r/1051378 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert)
[15:10:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P65667 and previous config saved to /var/cache/conftool/dbconfig/20240702-151050-marostegui.json
[15:11:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:12:12] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: deploy1003: Comment them out from scap_masters [puppet] - 10https://gerrit.wikimedia.org/r/1051392 (https://phabricator.wikimedia.org/T364417)
[15:12:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9945320 (10cmooney) All seems ok following the increase:  {F56173453 width=500}  FWIW the scraping is now taking longer, indicating that...
[15:12:47] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubeadm: core: remove deprecated pre-defined kubelet flags [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881)
[15:12:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[15:12:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[15:13:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kubeadm: core: remove deprecated pre-defined kubelet flags [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) (owner: 10Arturo Borrero Gonzalez)
[15:13:26] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] deploy1003: Comment them out from scap_masters [puppet] - 10https://gerrit.wikimedia.org/r/1051392 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris)
[15:13:48] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-web: Add traindev environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047158 (owner: 10Ahmon Dancy)
[15:14:42] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: Add traindev environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047158 (owner: 10Ahmon Dancy)
[15:15:25] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: kubeadm: core: remove deprecated pre-defined kubelet flags [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881)
[15:16:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] deploy1003: Comment them out from scap_masters [puppet] - 10https://gerrit.wikimedia.org/r/1051392 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris)
[15:16:18] <wikibugs>	 (03PS3) 10Elukey: docker::reporter: update exclude rules [puppet] - 10https://gerrit.wikimedia.org/r/1051379 (https://phabricator.wikimedia.org/T367427)
[15:16:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: ferm.service on wikikube-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:16:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) (owner: 10Arturo Borrero Gonzalez)
[15:17:44] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2007.codfw.wmnet with OS bookworm
[15:22:20] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:23:20] <wikibugs>	 (03PS1) 10Fabfur: hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051394 (https://phabricator.wikimedia.org/T367756)
[15:24:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for <Sharvaniharan> - https://phabricator.wikimedia.org/T368566#9945385 (10Sharvaniharan) Thank you @Ottomata and @Dzahn  Should I be doing anything to get the analytics-privatedata-users access, or is this task sufficient?
[15:24:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) (owner: 10Arturo Borrero Gonzalez)
[15:24:46] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051394 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[15:25:56] <wikibugs>	 (03PS20) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[15:25:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P65668 and previous config saved to /var/cache/conftool/dbconfig/20240702-152558-marostegui.json
[15:26:10] <wikibugs>	 (03CR) 10Gergő Tisza: Handle sso.wikimedia.org domain (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[15:30:25] <wikibugs>	 (03PS1) 10Fabfur: hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051395 (https://phabricator.wikimedia.org/T367756)
[15:30:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051395 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[15:32:06] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "current approach requires LVS to be rebooted to be applied, some exec stanzas would be needed to enforce the change on run-puppet-agent ti" [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney)
[15:33:18] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051394 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[15:33:48] <wikibugs>	 (03PS1) 10Jdlrobson: Make Flow work in dark mode by disabling backgrounds and setting text [extensions/Flow] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051396 (https://phabricator.wikimedia.org/T357600)
[15:35:02] <wikibugs>	 (03PS7) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T357575)
[15:35:11] <wikibugs>	 (03PS8) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T357575)
[15:35:34] <wikibugs>	 (03PS9) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366368)
[15:35:58] <wikibugs>	 (03PS1) 10Fabfur: hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051397 (https://phabricator.wikimedia.org/T367756)
[15:36:02] <wikibugs>	 (03PS10) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366368)
[15:36:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051397 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[15:38:14] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:38:17] <wikibugs>	 (03Abandoned) 10Fabfur: hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051395 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[15:38:23] <wikibugs>	 (03PS1) 10Brouberol: Superset: upgrade Superset to version 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051398 (https://phabricator.wikimedia.org/T366060)
[15:38:34] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051394 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[15:39:55] <wikibugs>	 06SRE, 10Thumbor: wikipedia-pl-sysop: local images fail to generate thumbnail - https://phabricator.wikimedia.org/T368945#9945465 (10Volans)
[15:40:04] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: kubeadm: core: remove deprecated pre-defined kubelet flags [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881)
[15:41:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65669 and previous config saved to /var/cache/conftool/dbconfig/20240702-154105-marostegui.json
[15:41:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[15:41:13] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[15:41:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[15:41:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T364069)', diff saved to https://phabricator.wikimedia.org/P65670 and previous config saved to /var/cache/conftool/dbconfig/20240702-154127-marostegui.json
[15:42:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubernetes1051.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1051.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:43:18] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm
[15:44:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 20:00:00 on kubernetes1051.eqiad.wmnet with reason: Hardware issue
[15:44:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9945483 (10Clement_Goubert) Host is flapping, setting downtime until tomorrow
[15:44:57] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on kubernetes1051.eqiad.wmnet with reason: Hardware issue
[15:45:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9945484 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d5196ee-59a9-4e12-b2fc-c8c25de6ab16) set by cgoubert@cumin1002...
[15:45:31] <wikibugs>	 (03PS1) 10Elukey: wmfdebug: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051402 (https://phabricator.wikimedia.org/T368366)
[15:45:43] <wikibugs>	 (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051402 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[15:46:06] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams
[15:47:29] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice. Remember that we still have to do a manual `superset db migrate` and a `superset init` once the new version is deployed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051398 (https://phabricator.wikimedia.org/T366060) (owner: 10Brouberol)
[15:48:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Superset: upgrade Superset to version 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051398 (https://phabricator.wikimedia.org/T366060) (owner: 10Brouberol)
[15:49:02] <wikibugs>	 (03CR) 10DCausse: [C:03+1] Search update pipeline: reduce client-side rate-limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051358 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[15:49:12] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams
[15:50:00] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply
[15:50:28] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply
[15:51:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for <Sharvaniharan> - https://phabricator.wikimedia.org/T368566#9945499 (10Volans) @Sharvaniharan I'll re-purpose this task for the revised requirement, I'll let you know if any data is missing
[15:51:32] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-master1004.eqiad.wmnet
[15:52:18] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9945514 (10Scott_French) Thanks for the sample data, @xcollazo.  Using the fir...
[15:55:11] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "tests aren't happy here:" [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[15:55:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9945529 (10Volans) p:05High→03Medium
[15:56:35] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] varnish: make donate.m redirect permanent and add tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[15:57:57] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage
[15:58:05] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1004.eqiad.wmnet
[16:00:04] <jouncebot>	 jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1600).
[16:00:04] <jouncebot>	 tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:11] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "Resurrect fluent-bit image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051407 (https://phabricator.wikimedia.org/T251812)
[16:00:15] <tgr|away>	 o/
[16:00:16] <rzl>	 tgr|away: hi I was just looking at this :)
[16:00:27] <rzl>	 it is more complex than I am comfortable deploying in the puppet window, I think
[16:00:43] <rzl>	 but let me see if I can find a domain expert who's willing to shepherd it through for you
[16:01:19] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_drmrs
[16:01:58] <tgr|away>	 thanks rzl 
[16:02:44] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage
[16:03:01] <wikibugs>	 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9945568 (10Volans) @Sharvaniharan please confirm to have read [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities | Analytics Data Access User R...
[16:03:24] <wikibugs>	 (03PS2) 10Ssingh: varnish: make donate.m redirect permanent and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645)
[16:03:30] <wikibugs>	 (03CR) 10Ssingh: "Thanks, updated!" [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[16:04:42] <wikibugs>	 (03PS5) 10Ayounsi: DHCP: send subnet-mask 255.255.255.255 for routed ganeti VMs [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152)
[16:04:42] <wikibugs>	 (03CR) 10Ayounsi: [V:03+1] "PCC is happy and the change has been tested with vmtest2007" [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[16:05:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] varnish: make donate.m redirect permanent and add tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[16:05:51] <wikibugs>	 (03PS7) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151)
[16:06:40] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Revert "Resurrect fluent-bit image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051407 (https://phabricator.wikimedia.org/T251812)
[16:07:30] <wikibugs>	 (03PS1) 10CDanis: CHANGELOG for configuration 1.8.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051412 (https://phabricator.wikimedia.org/T362310)
[16:09:25] <wikibugs>	 (03PS1) 10Volans: admin: add sharvaniharan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051413 (https://phabricator.wikimedia.org/T368566)
[16:09:45] <wikibugs>	 (03CR) 10Volans: [C:04-1] "Pending approval on task" [puppet] - 10https://gerrit.wikimedia.org/r/1051413 (https://phabricator.wikimedia.org/T368566) (owner: 10Volans)
[16:13:29] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_drmrs
[16:13:57] <wikibugs>	 (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: Add buffer [puppet] - 10https://gerrit.wikimedia.org/r/1051415 (https://phabricator.wikimedia.org/T367076)
[16:13:58] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945668 (10dcaro) >>! In T348643#9945304, @CDanis wrote: > Yeah okay, that's all pretty messy to potentially clean u...
[16:15:50] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] varnish: make donate.m redirect permanent and add tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[16:16:14] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2007.codfw.wmnet with OS bookworm
[16:17:39] <wikibugs>	 (03PS3) 10Ssingh: varnish: make donate.m redirect permanent and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645)
[16:17:50] <wikibugs>	 (03CR) 10Ssingh: varnish: make donate.m redirect permanent and add tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[16:18:29] <wikibugs>	 (03PS1) 10Dzahn: gerrit: remove NRPE process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1032526
[16:18:29] <wikibugs>	 (03CR) 10Dzahn: "@hashar - I'll just abandon this and keep it but I am still interested in the answer to that previous question. Do you have _actual_ mail " [puppet] - 10https://gerrit.wikimedia.org/r/1032526 (owner: 10Dzahn)
[16:20:15] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus)
[16:20:49] <wikibugs>	 (03CR) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn)
[16:20:56] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9945722 (10Dzahn)
[16:21:13] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "@volans thanks for handling the access requests so nicely this week. would you mind taking a look at this one too?" [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn)
[16:22:55] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "The thing here is that the owners of the machine are data-engineering but research team uses them. So based on that, who should be the act" [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn)
[16:27:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Toolforge elasticsearch haproxy: update CORS syntax for modern haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1051216 (https://phabricator.wikimedia.org/T311905) (owner: 10Andrew Bogott)
[16:27:11] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:27:29] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[16:27:58] <wikibugs>	 (03CR) 10Ssingh: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[16:28:01] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] varnish: make donate.m redirect permanent and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh)
[16:28:01] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:32:14] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945806 (10CDanis) Ah okay sorry.  Maybe experiment with running `rados bench` and slowly increasing the number of n...
[16:34:49] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[16:38:28] <wikibugs>	 (03PS4) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094)
[16:38:38] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9945824 (10xcollazo) @Scott_French : One odd thing I notice is that, even thou...
[16:38:50] <wikibugs>	 (03CR) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[16:44:10] <wikibugs>	 (03PS1) 10Jforrester: Update OOUI to v0.50.3 [vendor] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051416
[16:44:26] <wikibugs>	 (03PS1) 10Jforrester: Update OOUI to v0.50.3 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051417 (https://phabricator.wikimedia.org/T369010)
[16:45:09] <James_F>	 Okie-dokie, train-blocker ahoy.
[16:45:14] <James_F>	 jouncebot: nowandnext
[16:45:14] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1600)
[16:45:14] <jouncebot>	 In 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1700)
[16:45:26] <James_F>	 Hmm. Let's see if we can land this swiftly.
[16:46:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [vendor] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051416 (owner: 10Jforrester)
[16:46:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051417 (https://phabricator.wikimedia.org/T369010) (owner: 10Jforrester)
[16:48:06] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945840 (10dcaro) >>! In T348643#9945806, @CDanis wrote: > Ah okay sorry.  Maybe experiment with running `rados benc...
[17:00:00] <wikibugs>	 (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051420 (https://phabricator.wikimedia.org/T344471)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1700)
[17:00:55] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Feel free to have this deployed at anytime ;)" [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051417 (https://phabricator.wikimedia.org/T369010) (owner: 10Jforrester)
[17:01:08] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9945867 (10Scott_French) Thanks for taking a look, @xcollazo. I'll defer to @m...
[17:01:33] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051420 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[17:02:11] <wikibugs>	 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080 (10colewhite) 03NEW
[17:02:28] <wikibugs>	 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080#9945883 (10colewhite) p:05Triage→03High
[17:02:33] <wikibugs>	 (03PS11) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366366)
[17:02:39] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051420 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[17:06:25] <mutante>	 !log lists1004 - sudo systemctl start wmf_auto_restart_exim4 (T369017)
[17:06:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:27] <stashbot>	 T369017: SystemdUnitFailed - lists1004 - wmf_auto_restart_exim4 - https://phabricator.wikimedia.org/T369017
[17:06:33] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[17:06:50] <logmsgbot>	 !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[17:06:51] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[17:07:11] <logmsgbot>	 !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[17:07:12] <logmsgbot>	 !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[17:07:39] <logmsgbot>	 !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[17:09:44] <wikibugs>	 (03Merged) 10jenkins-bot: Update OOUI to v0.50.3 [vendor] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051416 (owner: 10Jforrester)
[17:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: Update OOUI to v0.50.3 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051417 (https://phabricator.wikimedia.org/T369010) (owner: 10Jforrester)
[17:10:21] <logmsgbot>	 !log jforrester@deploy1002 Started scap sync-world: Backport for [[gerrit:1051416|Update OOUI to v0.50.3]], [[gerrit:1051417|Update OOUI to v0.50.3 (T369010)]]
[17:10:24] <stashbot>	 T369010: Language dropdown on Special:NewItem is broken on Beta Wikidata - https://phabricator.wikimedia.org/T369010
[17:11:51] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] varnish: Copy value of X-Wikimedia-Debug cookie to header (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[17:14:37] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1051416|Update OOUI to v0.50.3]], [[gerrit:1051417|Update OOUI to v0.50.3 (T369010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:15:13] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Continuing with sync
[17:17:49] <wikibugs>	 (03PS1) 10Cwhite: admin: add new ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1051421
[17:19:02] <wikibugs>	 (03PS1) 10Cwhite: admin: remove old ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1051422
[17:20:27] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1051416|Update OOUI to v0.50.3]], [[gerrit:1051417|Update OOUI to v0.50.3 (T369010)]] (duration: 10m 06s)
[17:20:30] <stashbot>	 T369010: Language dropdown on Special:NewItem is broken on Beta Wikidata - https://phabricator.wikimedia.org/T369010
[17:22:42] <wikibugs>	 (03CR) 10Bking: "I think it's just hitting timeouts. If you go back in Grafana (https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=p" [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking)
[17:24:26] <wikibugs>	 (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[17:34:33] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[17:34:38] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[17:36:19] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[17:36:21] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[17:38:12] <wikibugs>	 (03PS60) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[17:39:21] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[17:39:23] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[17:40:25] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[17:40:42] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[17:42:32] <wikibugs>	 (03PS1) 10Cwhite: logstash: add curator delete job for ecs-k8s indices [puppet] - 10https://gerrit.wikimedia.org/r/1051427 (https://phabricator.wikimedia.org/T368186)
[17:51:56] <wikibugs>	 (03CR) 10Herron: [C:03+1] logstash: route thumbor logs in routing filter [puppet] - 10https://gerrit.wikimedia.org/r/1051214 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite)
[17:52:18] <wikibugs>	 (03CR) 10Dzahn: "For now just let me add this: I can help with solving the "route alerts per team". It's possible. We have done this for gerrit checks by c" [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking)
[17:52:47] <wikibugs>	 (03CR) 10Herron: [C:03+1] logstash: add curator delete job for ecs-k8s indices [puppet] - 10https://gerrit.wikimedia.org/r/1051427 (https://phabricator.wikimedia.org/T368186) (owner: 10Cwhite)
[17:57:18] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9946355 (10xcollazo) >>! In T368098#9944101, @ABran-WMF wrote: > [[ https://wm...
[17:59:43] <wikibugs>	 (03CR) 10Krinkle: varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[18:00:05] <jouncebot>	 hashar and jeena: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1800)
[18:08:30] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:10:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: base/statsd: add 1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051428
[18:10:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429
[18:10:51] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: statsd-exporter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051430
[18:11:22] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "Note: we want to re-evaulate tier 1 and 2 before deploying this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson)
[18:13:30] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:15:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292) (owner: 10Arlolra)
[18:15:57] <wikibugs>	 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080#9946469 (10Joe)  https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1051428 and followups should fix the issue
[18:16:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720) (owner: 10Arlolra)
[18:25:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[18:28:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9946538 (10cmooney) @Jhancock.wm can you confirm what position in the rack the server is in?  I assumed based on the first port it's in U45 so I...
[18:38:51] <wikibugs>	 (03PS1) 10Ahmon Dancy: DevServices.php: Add excimer-ui-url/excimer-ui-server placeholders [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051431
[18:41:13] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] DevServices.php: Add excimer-ui-url/excimer-ui-server placeholders [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051431 (owner: 10Ahmon Dancy)
[18:41:51] <wikibugs>	 (03Merged) 10jenkins-bot: DevServices.php: Add excimer-ui-url/excimer-ui-server placeholders [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051431 (owner: 10Ahmon Dancy)
[18:49:04] <wikibugs>	 (03PS1) 10JHathaway: postfix: add wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/1051432 (https://phabricator.wikimedia.org/T325406)
[18:50:09] <wikibugs>	 (03PS1) 10JHathaway: postfix: fix use param of $trusted_networks [puppet] - 10https://gerrit.wikimedia.org/r/1051433 (https://phabricator.wikimedia.org/T325406)
[18:51:17] <wikibugs>	 (03PS1) 10JHathaway: postfix: override default for parent_domain_matches_subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1051434 (https://phabricator.wikimedia.org/T325406)
[18:51:20] <wikibugs>	 (03PS1) 10JHathaway: postfix: verify recipients when possible [puppet] - 10https://gerrit.wikimedia.org/r/1051435 (https://phabricator.wikimedia.org/T325406)
[18:51:43] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051432 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[18:52:33] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051433 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[18:52:36] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051434 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[18:52:38] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051435 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[18:54:39] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: add wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/1051432 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[18:54:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T364069)', diff saved to https://phabricator.wikimedia.org/P65671 and previous config saved to /var/cache/conftool/dbconfig/20240702-185443-marostegui.json
[18:54:46] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[18:55:00] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: fix use param of $trusted_networks [puppet] - 10https://gerrit.wikimedia.org/r/1051433 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[19:07:17] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: override default for parent_domain_matches_subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1051434 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[19:07:24] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: verify recipients when possible [puppet] - 10https://gerrit.wikimedia.org/r/1051435 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[19:08:01] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1051439
[19:09:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106 (10cmooney) 03NEW p:05Triage→03Medium
[19:09:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P65672 and previous config saved to /var/cache/conftool/dbconfig/20240702-190950-marostegui.json
[19:10:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#9946735 (10cmooney)
[19:10:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9946736 (10cmooney)
[19:10:51] <wikibugs>	 (03PS1) 10Cathal Mooney: Add net data for codfw new per-rack subnets and add switches to rancid [puppet] - 10https://gerrit.wikimedia.org/r/1051440 (https://phabricator.wikimedia.org/T369106)
[19:16:27] <wikibugs>	 (03PS2) 10Cathal Mooney: Add net data for codfw new per-rack subnets and add switches to rancid [puppet] - 10https://gerrit.wikimedia.org/r/1051440 (https://phabricator.wikimedia.org/T369106)
[19:19:15] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:21:08] <wikibugs>	 (03PS5) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094)
[19:24:32] <wikibugs>	 (03PS6) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094)
[19:24:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P65673 and previous config saved to /var/cache/conftool/dbconfig/20240702-192457-marostegui.json
[19:25:05] <wikibugs>	 (03CR) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[19:27:33] <wikibugs>	 (03CR) 10Dduvall: [C:03+1] "Looks right!" [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy)
[19:35:38] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::toolforge::elasticsearch::keepalived: keepalived interface from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1051444 (https://phabricator.wikimedia.org/T311905)
[19:36:47] <wikibugs>	 (03PS1) 10JHathaway: temporarily add mx-in1001 as an MX server, test #2 [dns] - 10https://gerrit.wikimedia.org/r/1051445 (https://phabricator.wikimedia.org/T367517)
[19:36:56] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9946877 (10mforns) @Scott_French Would it be possible for us to make a last ho...
[19:37:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] temporarily add mx-in1001 as an MX server, test #2 [dns] - 10https://gerrit.wikimedia.org/r/1051445 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway)
[19:40:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T364069)', diff saved to https://phabricator.wikimedia.org/P65674 and previous config saved to /var/cache/conftool/dbconfig/20240702-194005-marostegui.json
[19:40:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance
[19:40:08] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[19:40:11] <wikibugs>	 (03PS2) 10JHathaway: temporarily add mx-in1001 as an MX server, test #2 [dns] - 10https://gerrit.wikimedia.org/r/1051445 (https://phabricator.wikimedia.org/T367517)
[19:40:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance
[19:40:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T364069)', diff saved to https://phabricator.wikimedia.org/P65675 and previous config saved to /var/cache/conftool/dbconfig/20240702-194027-marostegui.json
[19:41:26] <wikibugs>	 (03PS5) 10Herron: pyrra: add liftwing SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995)
[19:41:27] <wikibugs>	 (03CR) 10Herron: [V:03+1] "Hey Luca, thinking about revisiting this to see how it performs now.  What do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:41:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] profile::toolforge::elasticsearch::keepalived: keepalived interface from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1051444 (https://phabricator.wikimedia.org/T311905) (owner: 10Andrew Bogott)
[19:42:37] <wikibugs>	 (03PS1) 10Ryan Kemper: [WIP] wdqs graph split: new A and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364)
[19:43:49] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] temporarily add mx-in1001 as an MX server, test #2 [dns] - 10https://gerrit.wikimedia.org/r/1051445 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway)
[19:45:31] <jhathaway>	 !log running another email inbound mx test on mx-in1001
[19:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:56:47] <wikibugs>	 (03PS1) 10Btullis: cephcsi: bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051447 (https://phabricator.wikimedia.org/T327259)
[19:59:56] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9947026 (10Scott_French) @mforns sure, that's no problem at all! Just let me k...
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T2000).
[20:00:04] <jouncebot>	 kimberly_sarabia and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:27] <kimberly_sarabia>	 hello!
[20:00:33] <urbanecm>	 i can deploy today
[20:00:35] <urbanecm>	 hello kimberly_sarabia 
[20:01:09] <urbanecm>	 arlolra: around?
[20:01:36] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson)
[20:02:13] <wikibugs>	 (03Merged) 10jenkins-bot: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson)
[20:02:35] <arlolra>	 urbanecm: yes, around
[20:03:01] <wikibugs>	 (03PS2) 10Arlolra: Remove unused Linter configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292)
[20:03:04] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Remove unused Linter configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292) (owner: 10Arlolra)
[20:03:06] <urbanecm>	 yay! :)
[20:03:43] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused Linter configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292) (owner: 10Arlolra)
[20:04:16] <wikibugs>	 (03CR) 10Btullis: [C:03+2] cephcsi: bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051447 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[20:04:34] <logmsgbot>	 !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1050085|[July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis (T367151)]], [[gerrit:1048138|Remove unused Linter configs (T343292)]]
[20:04:39] <stashbot>	 T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin	 - https://phabricator.wikimedia.org/T367151
[20:04:41] <stashbot>	 T343292: Deprecate and then remove Linter config variables used to control new linter table field access - https://phabricator.wikimedia.org/T343292
[20:07:19] <wikibugs>	 (03Merged) 10jenkins-bot: cephcsi: bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051447 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[20:07:21] <logmsgbot>	 !log urbanecm@deploy1002 jdlrobson, arlolra, urbanecm: Backport for [[gerrit:1050085|[July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis (T367151)]], [[gerrit:1048138|Remove unused Linter configs (T343292)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:07:29] <urbanecm>	 kimberly_sarabia: please test at mwdebug
[20:07:41] <urbanecm>	 arlolra: your first patch is at debug as well, but looks like tzhere might not be anything to test?
[20:07:55] <kimberly_sarabia>	 ok i need a couple minutes. have to look at several wikis
[20:09:13] <urbanecm>	 sure
[20:09:15] <arlolra>	 urbanecm: thanks, I'll just verify linting is still working
[20:09:22] <urbanecm>	 arlolra: ack, will wait on you
[20:11:21] <wikibugs>	 (03PS1) 10CDanis: copy patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051449 (https://phabricator.wikimedia.org/T363407)
[20:11:23] <wikibugs>	 (03PS1) 10CDanis: mesh: use namespace for default service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051450 (https://phabricator.wikimedia.org/T363407)
[20:11:48] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add net data for codfw new per-rack subnets and add switches to rancid [puppet] - 10https://gerrit.wikimedia.org/r/1051440 (https://phabricator.wikimedia.org/T369106) (owner: 10Cathal Mooney)
[20:12:18] <kimberly_sarabia>	 urbanecm: LGTM!
[20:12:22] <wikibugs>	 (03PS2) 10CDanis: mesh: use namespace for default service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051450 (https://phabricator.wikimedia.org/T363407)
[20:12:23] <urbanecm>	 ack, ty!
[20:12:25] <urbanecm>	 waiting on arlolra 
[20:14:28] <arlolra>	 Please go ahead
[20:15:22] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[20:15:46] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[20:15:59] <urbanecm>	 proceeding
[20:16:01] <logmsgbot>	 !log urbanecm@deploy1002 jdlrobson, arlolra, urbanecm: Continuing with sync
[20:16:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#9947129 (10cmooney)
[20:16:36] <wikibugs>	 (03PS2) 10Arlolra: Follow the defaults for Parsoid on MFE on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720)
[20:16:40] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Follow the defaults for Parsoid on MFE on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720) (owner: 10Arlolra)
[20:17:14] <wikibugs>	 (03Merged) 10jenkins-bot: Follow the defaults for Parsoid on MFE on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720) (owner: 10Arlolra)
[20:21:06] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1050085|[July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis (T367151)]], [[gerrit:1048138|Remove unused Linter configs (T343292)]] (duration: 16m 31s)
[20:21:11] <stashbot>	 T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin	 - https://phabricator.wikimedia.org/T367151
[20:21:12] <stashbot>	 T343292: Deprecate and then remove Linter config variables used to control new linter table field access - https://phabricator.wikimedia.org/T343292
[20:21:20] <urbanecm>	 first one done
[20:21:47] <logmsgbot>	 !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1048144|Follow the defaults for Parsoid on MFE on officewiki (T363720)]]
[20:21:50] <stashbot>	 T363720: Provide ParserMigration option to exclude mobile frontend  - https://phabricator.wikimedia.org/T363720
[20:24:34] <logmsgbot>	 !log urbanecm@deploy1002 arlolra, urbanecm: Backport for [[gerrit:1048144|Follow the defaults for Parsoid on MFE on officewiki (T363720)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:25:07] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host wikikube-ctrl2002.codfw.wmnet
[20:25:52] <urbanecm>	 arlolra: please take a look at the second patch please
[20:26:00] <arlolra>	 Will do
[20:27:48] <arlolra>	 Ok, working as expected
[20:28:16] <logmsgbot>	 !log urbanecm@deploy1002 arlolra, urbanecm: Continuing with sync
[20:28:19] <urbanecm>	 proceeding, thanks
[20:30:15] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9947164 (10xcollazo) `20240701` run update:  Most all wikis are now done with...
[20:31:53] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[20:33:32] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1048144|Follow the defaults for Parsoid on MFE on officewiki (T363720)]] (duration: 11m 44s)
[20:33:34] <stashbot>	 T363720: Provide ParserMigration option to exclude mobile frontend  - https://phabricator.wikimedia.org/T363720
[20:33:44] <urbanecm>	 arlolra: and, done
[20:33:46] <urbanecm>	 anything else?
[20:33:54] <kimberly_sarabia>	 All good. Thanks so much
[20:33:58] <urbanecm>	 any time!
[20:34:18] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002"
[20:35:12] <wikibugs>	 (03PS1) 10JHathaway: Revert "temporarily add mx-in1001 as an MX server, test #2" [dns] - 10https://gerrit.wikimedia.org/r/1051452
[20:35:20] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002"
[20:35:20] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:36:50] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Revert "temporarily add mx-in1001 as an MX server, test #2" [dns] - 10https://gerrit.wikimedia.org/r/1051452 (owner: 10JHathaway)
[20:37:56] <wikibugs>	 (03PS1) 10CDanis: Bump mediawiki chart version & mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051453 (https://phabricator.wikimedia.org/T363407)
[20:39:09] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED
[20:39:40] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051454
[20:40:38] <arlolra>	 thanks urbanecm 
[20:40:45] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051454 (owner: 10Ahmon Dancy)
[20:41:23] <wikibugs>	 (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051454 (owner: 10Ahmon Dancy)
[20:42:37] <wikibugs>	 (03PS1) 10Ahmon Dancy: DevServices.php: Set ipoid placeholder [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051456
[20:42:52] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] DevServices.php: Set ipoid placeholder [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051456 (owner: 10Ahmon Dancy)
[20:43:34] <wikibugs>	 (03Merged) 10jenkins-bot: DevServices.php: Set ipoid placeholder [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051456 (owner: 10Ahmon Dancy)
[20:44:15] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:45:49] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED
[20:45:52] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[20:48:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9947208 (10cmooney) Also @Jhancock.wm when next on site can you check the mgmt / idrac connection for this one?  It doesn't seem to be trying to...
[20:49:27] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002"
[20:50:15] <icinga-wm>	 PROBLEM - Postfix SMTP on mx-in1001 is CRITICAL: connect to address 208.80.155.102 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[20:50:29] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002"
[20:50:29] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:52:11] <wikibugs>	 (03PS6) 10Ayounsi: DHCP: send subnet-mask 255.255.255.255 for routed ganeti VMs [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152)
[20:52:11] <wikibugs>	 (03PS1) 10Ayounsi: Routed Ganeti: add public v4 tap_ip [puppet] - 10https://gerrit.wikimedia.org/r/1051458 (https://phabricator.wikimedia.org/T362330)
[20:52:26] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051458 (https://phabricator.wikimedia.org/T362330) (owner: 10Ayounsi)
[20:53:15] <icinga-wm>	 RECOVERY - Postfix SMTP on mx-in1001 is OK: OK - Certificate mx-in1001.wikimedia.org will expire on Wed 11 Sep 2024 07:47:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[20:59:02] <wikibugs>	 (03PS2) 10RLazarus: mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966)
[20:59:20] <wikibugs>	 (03PS4) 10RLazarus: deployment_server: Add --follow, --attach to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966)
[21:00:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9947237 (10jhathaway)
[21:01:13] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9947239 (10jhathaway)
[21:01:24] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus)
[21:03:11] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus)
[21:03:42] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus)
[21:06:44] <wikibugs>	 06SRE, 06collaboration-services, 06DBA, 13Patch-For-Review: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9947250 (10eoghan) a:05eoghan→03Ladsgroup Spoken with @Ladsgroup , I think there's nothing immediate for sre-collab to do here so reassigning. Feel free to send it back to m...
[21:10:19] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] base/statsd: add 1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051428 (owner: 10Giuseppe Lavagetto)
[21:11:12] <wikibugs>	 (03Merged) 10jenkins-bot: base/statsd: add 1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051428 (owner: 10Giuseppe Lavagetto)
[21:11:40] <wikibugs>	 (03PS2) 10RLazarus: statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto)
[21:11:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto)
[21:12:45] <wikibugs>	 (03PS3) 10RLazarus: statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto)
[21:15:17] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto)
[21:16:06] <wikibugs>	 (03Merged) 10jenkins-bot: statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto)
[21:17:59] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] statsd-exporter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051430 (owner: 10Giuseppe Lavagetto)
[21:28:05] <wikibugs>	 (03CR) 10RLazarus: statsd-exporter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051430 (owner: 10Giuseppe Lavagetto)
[21:28:09] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] statsd-exporter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051430 (owner: 10Giuseppe Lavagetto)
[21:30:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9947337 (10Jhancock.wm) a:03VRiley-WMF
[21:30:47] <wikibugs>	 (03PS2) 10CDanis: Bump mediawiki chart version & mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051453 (https://phabricator.wikimedia.org/T363407)
[21:30:47] <wikibugs>	 (03PS1) 10CDanis: DO NOT SUBMIT, testing mesh change against mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051466
[21:31:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9947342 (10Jhancock.wm) a:03VRiley-WMF
[21:35:53] <icinga-wm>	 PROBLEM - Disk space on restbase2023 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 113989 MB (6% inode=99%): /srv/sdc4 69102 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2023&var-datasource=codfw+prometheus/ops
[21:44:02] <rzl>	 jouncebot: nowandnext
[21:44:02] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 15 minute(s)
[21:44:02] <jouncebot>	 In 8 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T0600)
[21:44:38] <rzl>	 doing a quick helmfile-only MW deploy for T369080
[21:44:38] <stashbot>	 T369080: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080
[21:51:04] <logmsgbot>	 !log rzl@deploy1002 Started scap sync-world: T369080
[21:51:07] <stashbot>	 T369080: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080
[21:52:47] <logmsgbot>	 !log rzl@deploy1002 rzl: T369080 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:54:11] <logmsgbot>	 !log rzl@deploy1002 rzl: Continuing with sync
[21:54:43] <logmsgbot>	 !log rzl@deploy1002 Finished scap: T369080 (duration: 04m 13s)
[21:55:59] <rzl>	 ah, I missed the recent changes to the statsd-exporter deployment -- I see scap doesn't touch it, deploying it manually with helmfile now
[21:56:22] <rzl>	 just when I finally get used to "never run helmfile across all mw deployments, use scap instead" :)
[21:57:14] <wikibugs>	 (03PS2) 10Wargo: Namespace and import configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051467
[21:57:49] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[21:58:16] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[21:58:17] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[21:58:27] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[22:01:39] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[22:01:55] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[22:01:56] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[22:02:09] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[22:02:10] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[22:02:25] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[22:02:26] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[22:02:40] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[22:02:42] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[22:02:55] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[22:02:56] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[22:03:10] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[22:03:11] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply
[22:03:14] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply
[22:03:15] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[22:03:17] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[22:03:18] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[22:03:30] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[22:03:31] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[22:03:40] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[22:03:41] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[22:03:43] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[22:03:44] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[22:03:46] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[22:03:48] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply
[22:04:08] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply
[22:04:09] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply
[22:04:21] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply
[22:04:24] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[22:04:38] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[22:04:39] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[22:04:50] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[22:04:51] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[22:05:00] <logmsgbot>	 !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[22:05:01] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[22:05:08] <logmsgbot>	 !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[22:05:36] <wikibugs>	 (03PS12) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366366)
[22:05:44] <wikibugs>	 (03PS4) 10Jdlrobson: [July 15th] Deploy dark mode to all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795)
[22:05:49] <rzl>	 done deploying
[22:08:41] <wikibugs>	 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080#9947463 (10RLazarus) Disregard the above scap, I got too carried away with "never run helmfile across all mw deployments, use scap instead" but obviously that ru...
[22:13:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T364069)', diff saved to https://phabricator.wikimedia.org/P65676 and previous config saved to /var/cache/conftool/dbconfig/20240702-221312-marostegui.json
[22:13:20] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[22:21:06] <wikibugs>	 (03PS2) 10Wargo: Set logo and favicon for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051469 (https://phabricator.wikimedia.org/T368712)
[22:25:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051469 (https://phabricator.wikimedia.org/T368712) (owner: 10Wargo)
[22:25:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051467 (owner: 10Wargo)
[22:28:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P65677 and previous config saved to /var/cache/conftool/dbconfig/20240702-222820-marostegui.json
[22:43:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P65678 and previous config saved to /var/cache/conftool/dbconfig/20240702-224328-marostegui.json
[22:58:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T364069)', diff saved to https://phabricator.wikimedia.org/P65679 and previous config saved to /var/cache/conftool/dbconfig/20240702-225835-marostegui.json
[22:58:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance
[22:58:39] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[22:58:51] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance
[23:10:13] <wikibugs>	 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080#9947536 (10colewhite) Thank you @RLazarus!  @dcausse, I see some metrics now at `mediawiki_cirrus_search_request_time_bucket`.  Anything amiss?
[23:19:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T367856)', diff saved to https://phabricator.wikimedia.org/P65680 and previous config saved to /var/cache/conftool/dbconfig/20240702-231945-marostegui.json
[23:19:49] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[23:34:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P65681 and previous config saved to /var/cache/conftool/dbconfig/20240702-233452-marostegui.json
[23:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051486
[23:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051486 (owner: 10TrainBranchBot)
[23:50:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P65682 and previous config saved to /var/cache/conftool/dbconfig/20240702-234959-marostegui.json
[23:51:45] <wikibugs>	 (03CR) 10Eccenux: "Seems like it needs yaml update too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051469 (https://phabricator.wikimedia.org/T368712) (owner: 10Wargo)
[23:54:59] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1051427 (https://phabricator.wikimedia.org/T368186) (owner: 10Cwhite)
[23:55:21] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1051214 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite)