[00:02:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051218 (owner: 10TrainBranchBot) [00:13:15] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:14:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:14:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bullseye [00:14:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1037.eqiad.wmnet with OS bullseye comp... [00:14:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P65606 and previous config saved to /var/cache/conftool/dbconfig/20240702-001448-marostegui.json [00:15:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:16:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [00:16:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1038.eqiad.wmnet with OS bullseye [00:16:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942773 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye comp... [00:16:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942777 (10Jclark-ctr) [00:16:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942778 (10Jclark-ctr) a:03Jclark-ctr [00:17:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942779 (10Jclark-ctr) [00:18:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942781 (10Jclark-ctr) @VRiley-WMF if you can update with 2nd network connection then hand over to @cmooney [00:21:07] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368866#9942784 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate of T362033 [00:23:50] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#9942789 (10Jclark-ctr) @BTullis if you get a chance to update files. These are ready to be imaged and handed over [00:27:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9942790 (10Jclark-ctr) @Andrew @dcaro thank you for providing update did you have host names for this and please update preseed.yaml, and site.pp [00:29:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P65607 and previous config saved to /var/cache/conftool/dbconfig/20240702-002955-marostegui.json [00:32:46] FIRING: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:45:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T364069)', diff saved to https://phabricator.wikimedia.org/P65608 and previous config saved to /var/cache/conftool/dbconfig/20240702-004502-marostegui.json [00:45:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [00:45:05] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [00:45:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [00:45:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T364069)', diff saved to https://phabricator.wikimedia.org/P65609 and previous config saved to /var/cache/conftool/dbconfig/20240702-004524-marostegui.json [00:45:57] (03PS3) 10RLazarus: deployment_server: Add --follow, --attach to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) [00:47:11] (03CR) 10RLazarus: deployment_server: Add --follow, --attach to mwscript_k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus) [00:52:46] RESOLVED: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:02:48] (03PS6) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) [01:08:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.12 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051223 (https://phabricator.wikimedia.org/T366957) [01:08:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.12 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051223 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [01:18:27] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:20:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.015s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:23:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:25:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.015s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:29:59] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.12 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051223 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [01:46:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:58] 06SRE: wikipedia-pl-sysop: local images fail to generate thumbnail - https://phabricator.wikimedia.org/T368945#9942841 (10Peachey88) [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0200) [02:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0300) [03:01:54] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051231 (https://phabricator.wikimedia.org/T366957) [03:01:56] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051231 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [03:02:36] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051231 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [03:03:06] !log mwpresync@deploy1002 Started scap sync-world: testwikis wikis to 1.43.0-wmf.12 refs T366957 [03:03:09] T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957 [03:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:21:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T367856)', diff saved to https://phabricator.wikimedia.org/P65610 and previous config saved to /var/cache/conftool/dbconfig/20240702-032121-marostegui.json [03:21:25] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [03:27:00] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:36:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P65611 and previous config saved to /var/cache/conftool/dbconfig/20240702-033628-marostegui.json [03:39:00] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:48:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T364069)', diff saved to https://phabricator.wikimedia.org/P65612 and previous config saved to /var/cache/conftool/dbconfig/20240702-034805-marostegui.json [03:48:14] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:51:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P65613 and previous config saved to /var/cache/conftool/dbconfig/20240702-035135-marostegui.json [03:54:39] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.12 refs T366957 (duration: 51m 33s) [03:54:42] T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957 [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0400) [04:01:06] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.9 (duration: 01m 02s) [04:03:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P65614 and previous config saved to /var/cache/conftool/dbconfig/20240702-040312-marostegui.json [04:06:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T367856)', diff saved to https://phabricator.wikimedia.org/P65615 and previous config saved to /var/cache/conftool/dbconfig/20240702-040643-marostegui.json [04:06:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [04:06:46] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:06:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [04:07:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T367856)', diff saved to https://phabricator.wikimedia.org/P65616 and previous config saved to /var/cache/conftool/dbconfig/20240702-040705-marostegui.json [04:18:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P65617 and previous config saved to /var/cache/conftool/dbconfig/20240702-041819-marostegui.json [04:33:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T364069)', diff saved to https://phabricator.wikimedia.org/P65618 and previous config saved to /var/cache/conftool/dbconfig/20240702-043326-marostegui.json [04:33:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [04:33:32] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:33:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [04:33:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T364069)', diff saved to https://phabricator.wikimedia.org/P65619 and previous config saved to /var/cache/conftool/dbconfig/20240702-043349-marostegui.json [04:47:20] RECOVERY - Disk space on build2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [04:57:38] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 3 (deploy1003, ...), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:58:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T368371 [04:58:53] T368371: Switchover s8 master (db1192 -> db1209) - https://phabricator.wikimedia.org/T368371 [04:58:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1209 with weight 0 T368371', diff saved to https://phabricator.wikimedia.org/P65620 and previous config saved to /var/cache/conftool/dbconfig/20240702-045856-marostegui.json [04:59:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T368371 [04:59:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1209 remove from API T368371', diff saved to https://phabricator.wikimedia.org/P65621 and previous config saved to /var/cache/conftool/dbconfig/20240702-045929-marostegui.json [04:59:55] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049478 (https://phabricator.wikimedia.org/T368371) [04:59:57] (03PS2) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1049479 (https://phabricator.wikimedia.org/T368371) [05:00:36] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1049478 (https://phabricator.wikimedia.org/T368371) (owner: 10Gerrit maintenance bot) [05:23:55] !log Starting s8 eqiad failover from db1192 to db1209 - T368371 [05:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:59] T368371: Switchover s8 master (db1192 -> db1209) - https://phabricator.wikimedia.org/T368371 [05:24:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T368371', diff saved to https://phabricator.wikimedia.org/P65622 and previous config saved to /var/cache/conftool/dbconfig/20240702-052408-marostegui.json [05:24:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1209 to s8 primary and set section read-write T368371', diff saved to https://phabricator.wikimedia.org/P65623 and previous config saved to /var/cache/conftool/dbconfig/20240702-052447-marostegui.json [05:25:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1192 T368371', diff saved to https://phabricator.wikimedia.org/P65624 and previous config saved to /var/cache/conftool/dbconfig/20240702-052543-root.json [05:26:57] (03CR) 10Marostegui: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1049479 (https://phabricator.wikimedia.org/T368371) (owner: 10Gerrit maintenance bot) [05:27:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65625 and previous config saved to /var/cache/conftool/dbconfig/20240702-052759-root.json [05:28:37] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9943050 (10Marostegui) [05:43:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65626 and previous config saved to /var/cache/conftool/dbconfig/20240702-054304-root.json [05:45:24] (03PS1) 10Giuseppe Lavagetto: Rebuild images to pick up a new version of glogger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051243 (https://phabricator.wikimedia.org/T368640) [05:47:19] (03PS2) 10Marostegui: orchestrator: Add volans to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1051141 [05:47:19] (03PS1) 10Marostegui: filtered_tables.txt: Remove flaggedpage_pending flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051244 (https://phabricator.wikimedia.org/T368939) [05:47:45] (03CR) 10Marostegui: [C:03+2] orchestrator: Add volans to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1051141 (owner: 10Marostegui) [05:47:54] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove flaggedpage_pending flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051244 (https://phabricator.wikimedia.org/T368939) (owner: 10Marostegui) [05:51:08] (03PS1) 10Marostegui: table_jobs.yaml: Remove flaggedpage_pending and flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051245 (https://phabricator.wikimedia.org/T365568) [05:58:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65627 and previous config saved to /var/cache/conftool/dbconfig/20240702-055809-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65628 and previous config saved to /var/cache/conftool/dbconfig/20240702-061315-root.json [06:16:08] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:18:52] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Rebuild images to pick up a new version of glogger [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051243 (https://phabricator.wikimedia.org/T368640) (owner: 10Giuseppe Lavagetto) [06:19:12] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:20:50] (03CR) 10Filippo Giunchedi: [C:03+1] thanos: increase query frontend and store cache sizes [puppet] - 10https://gerrit.wikimedia.org/r/1051177 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [06:21:16] <_joe_> !log rebuilding httpd-fcgi, mediawiki-httpd images T363342 T368640 [06:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:20] T363342: glogger crashes regularly in mw-on-k8s containers - https://phabricator.wikimedia.org/T363342 [06:21:21] T368640: glogger produces invalid JSON when given input with non-printable characters - https://phabricator.wikimedia.org/T368640 [06:21:55] (03CR) 10Filippo Giunchedi: [C:03+1] prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088) (owner: 10Herron) [06:24:07] <_joe_> jouncebot: now [06:24:07] For the next 0 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0600) [06:24:07] For the next 0 hour(s) and 5 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0600) [06:24:48] <_joe_> marostegui: lmk when you're done, I want to do a null deployment with scap to ensure my new image versions don't mess up something [06:28:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65629 and previous config saved to /var/cache/conftool/dbconfig/20240702-062820-root.json [06:31:47] <_joe_> ok I guess I can go on [06:35:04] !log oblivian@deploy1002 Started scap sync-world: Rebuilding images for change to the base image for httpd [06:43:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65630 and previous config saved to /var/cache/conftool/dbconfig/20240702-064326-root.json [06:47:43] (03CR) 10Ladsgroup: [C:03+1] table_jobs.yaml: Remove flaggedpage_pending and flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051245 (https://phabricator.wikimedia.org/T365568) (owner: 10Marostegui) [06:56:41] <_joe_> jouncebot: next [06:56:41] In 0 hour(s) and 3 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0700) [06:57:20] <_joe_> oh nothing in the deployment calendar, so i guess it's not a problem if my full rebuild scap lasts a little later [06:58:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65631 and previous config saved to /var/cache/conftool/dbconfig/20240702-065831-root.json [06:59:34] (03PS1) 10Kosta Harlan: Revert "QuickSurveys: Add testing survey configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459) [06:59:44] _joe_: I'm about to add something to the calendar [06:59:54] !log update netboot bookworm image to pickup new point release [06:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:02] _joe_: but it is very low priority so could be done later [07:00:04] <_joe_> kostajh: by the time you're done my deployment will be done :) [07:00:05] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:21] <_joe_> kostajh: no please go on [07:00:34] <_joe_> in 2-3 minutes tops my deployment is done [07:01:17] ack [07:01:21] !log oblivian@deploy1002 Finished scap: Rebuilding images for change to the base image for httpd (duration: 26m 52s) [07:01:26] <_joe_> and done :) [07:02:00] (03PS2) 10Kosta Harlan: Revert "QuickSurveys: Add testing survey configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459) [07:03:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [07:04:07] ok, starting deploy [07:04:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [07:05:28] (03Merged) 10jenkins-bot: Revert "QuickSurveys: Add testing survey configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051246 (https://phabricator.wikimedia.org/T368459) (owner: 10Kosta Harlan) [07:06:10] !log kharlan@deploy1002 Started scap sync-world: Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]] [07:06:13] T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459 [07:06:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:14:22] scap seems to be stuck on `Started docker pull on k8s nodes` at 99% [07:16:35] restarting the process [07:16:42] !log kharlan@deploy1002 Started scap sync-world: Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]] [07:16:45] T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459 [07:16:54] <_joe_> kostajh: it's not stuck... [07:17:20] _joe_: oh. how could you tell? I restarted it already [07:17:32] there was no output on the 99% stage after 5 minutes [07:18:46] it's at `07:17:49 docker_pull_k8s: 99% (in-flight: 2; ok: 428; fail: 0; left: 0)` again, I'll be more patient this time [07:19:24] <_joe_> kostajh: it will timeout eventually, it's possible there's some nodes down/unresponsive [07:19:31] * _joe_ afk [07:21:23] urbanecm / Amir1 can you advise on what I should do if it times out? Do I need to make a revert of patch and try to sync it, even if the first patch failed to sync? [07:24:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T364069)', diff saved to https://phabricator.wikimedia.org/P65632 and previous config saved to /var/cache/conftool/dbconfig/20240702-072426-marostegui.json [07:24:29] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:32:47] kostajh: during the pull k8s stage, nothing, as long as it is not affecting like half of the nodes. [07:33:24] iirc, scap will not even complain about the timeout in a hard way, it'll just continue [07:34:05] source: https://wm-bot.wmcloud.org/browser/index.php?start=06%2F17%2F2024&end=06%2F17%2F2024&display=%23wikimedia-operations (2024-06-17 13:35:34) [07:37:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:37:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [07:37:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:39:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P65633 and previous config saved to /var/cache/conftool/dbconfig/20240702-073933-marostegui.json [07:40:56] (03CR) 10Marostegui: [C:03+2] table_jobs.yaml: Remove flaggedpage_pending and flaggedtemplates [puppet] - 10https://gerrit.wikimedia.org/r/1051245 (https://phabricator.wikimedia.org/T365568) (owner: 10Marostegui) [07:42:22] (03CR) 10JMeybohm: ""minor" 😄 - thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [07:43:38] RECOVERY - Disk space on an-web1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-web1001&var-datasource=eqiad+prometheus/ops [07:47:16] (03PS1) 10Filippo Giunchedi: librenms: use ec certificates only [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) [07:51:54] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:51:57] T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459 [07:52:49] !log kharlan@deploy1002 kharlan: Continuing with sync [07:54:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P65634 and previous config saved to /var/cache/conftool/dbconfig/20240702-075440-marostegui.json [07:57:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-int.eqiad.main in mw-api-int at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:58:14] (03CR) 10Fabfur: "with the base64rawurl decoder we can avoid hex" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [07:58:28] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]] (duration: 41m 45s) [07:58:30] T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459 [07:59:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T367856)', diff saved to https://phabricator.wikimedia.org/P65635 and previous config saved to /var/cache/conftool/dbconfig/20240702-075904-marostegui.json [07:59:07] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:00:04] hashar and jeena: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T0800) [08:00:29] urbanecm: hi, you around? could you check how's the maintenance script from yeterday doing? [08:00:33] sure [08:00:46] MatmaRex: it is completed [08:00:50] nice [08:00:57] MatmaRex: do you want the log? [08:01:06] !log cordon kubernetes1051.eqiad.wmnet because of several failed image pulls [08:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:29] urbanecm: yeah, if it isn't a big chore, can you drop it on the task? thank you [08:02:44] FIRING: [4x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-ext.eqiad.main in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:02:56] hashar | jeena: can you hold for a minute please? I'd like to double check the backport deploy because ^ [08:03:15] MatmaRex: no problem, published at https://phabricator.wikimedia.org/T356196#9943331 [08:03:25] thanks [08:03:25] backport failed [08:03:39] `backport failed: Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=kharlan', 'Backport for [[gerrit:1051246|Revert "QuickSurveys: Add testing survey configuration" (T368459)]]']' returned non-zero exit status` [08:03:40] T368459: Test new QuickSurveys features on testwiki - https://phabricator.wikimedia.org/T368459 [08:03:49] kostajh: what did it say ...ah [08:04:05] very informative :D [08:04:21] kostajh: there should be a more detailed error message somewhere up [08:04:27] looking [08:04:34] at 07:50 there was `07:50:05 1 K8s nodes failed to pull the multiversion image` [08:04:47] followed by `07:50:05 Finished docker pull on k8s nodes (duration: 32m 40s)` [08:05:21] and at 7:50:05 there was also `07:50:05 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-07-02-071649-publish (ran as mwdeploy@kubernetes1051.eqiad.wmnet) returned [143]: Pulling 'docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-07-02-071649-publish'...` which ended with `Terminated` [08:05:22] there is one node not behaving properly (kubernetes1051.eqiad.wmnet) ..and not failing properly [08:05:51] I guess that is `ran as mwdeploy@kubernetes1051.eqiad.wmnet` [08:06:12] yes [08:06:50] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru [08:07:04] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru [08:07:19] !log draining kubernetes1051.eqiad.wmnet [08:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:22] what (if anything) should I do? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1051246 is now merged but not deployed. [08:08:15] kostajh: I'm not 100% sure as we're now spilling in the train window...cc hashar/jeena [08:08:38] (03PS1) 10KartikMistry: Update MinT to 2024-07-02-060114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051290 (https://phabricator.wikimedia.org/T364525) [08:09:04] I've taken out kubernetes1051 so if that was the problem (which I suspect) retying should work in a minute [08:09:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T364069)', diff saved to https://phabricator.wikimedia.org/P65637 and previous config saved to /var/cache/conftool/dbconfig/20240702-080948-marostegui.json [08:09:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [08:09:51] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:10:01] ah, AIUI train is not going to happen because https://phabricator.wikimedia.org/T366957 [08:10:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [08:10:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:10:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:10:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T364069)', diff saved to https://phabricator.wikimedia.org/P65638 and previous config saved to /var/cache/conftool/dbconfig/20240702-081025-marostegui.json [08:10:45] jayme: ok please let me know when I should retry [08:11:03] kostajh: sure, give me 5' [08:11:08] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2027.*} and A:cp [08:11:23] (03CR) 10JMeybohm: [C:03+2] admin_ng: Switch eqiad und codfw wikikube clusters to PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [08:12:44] RESOLVED: [4x] KubernetesDeploymentUnavailableReplicas: Deployment mw-api-ext.eqiad.main in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:12:59] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2027.*} and A:cp [08:13:36] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2028.*} and A:cp [08:14:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P65639 and previous config saved to /var/cache/conftool/dbconfig/20240702-081411-marostegui.json [08:14:33] (03Merged) 10jenkins-bot: admin_ng: Switch eqiad und codfw wikikube clusters to PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [08:14:51] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2028.*} and A:cp [08:15:30] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051291 (https://phabricator.wikimedia.org/T366957) [08:15:31] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051291 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [08:15:33] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:15:48] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2030.*} and A:cp [08:15:50] I guess train deployment is happening? [08:16:08] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051291 (https://phabricator.wikimedia.org/T366957) (owner: 10TrainBranchBot) [08:16:19] kostajh: looks like it :/ [08:16:36] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:17:03] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2030.*} and A:cp [08:17:37] question is who is running it :) [08:19:59] (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051292 (https://phabricator.wikimedia.org/T367756) [08:20:27] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2031.*} and A:cp [08:20:28] kostajh: I'd say we wait...at least I'm not sure whats supposed to happen rn. Are you okay with re-trying in the afternoon window? Or will the train roll out your change anyways? [08:20:40] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051292 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [08:22:16] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2031.*} and A:cp [08:23:12] docker_pull_k8s: 99% (in-flight: 1; ok: 429; fail: 0; left: 0) [08:23:24] looks like one of the k8s worker is a little slow for some reason ;) [08:23:39] kostajh: jayme: yes I have started the train, sorry I forgot to check here :/ [08:24:02] hashar: we tried to reach out...one of the nodes is borked and will not pull the image [08:24:09] ah ok [08:24:29] my guess is the docker pull made by scap does not have a timeout [08:24:32] but it will also not run mw as it's cordoned now. I'm unsure how scap will handle that though [08:26:34] so we gotta remove it from the dsh group [08:28:12] 08:27:53 docker_pull_k8s: 100% (in-flight: 0; ok: 430; fail: 0; left: 0) [08:28:19] 08:27:53 docker_pull_k8s: 100% (in-flight: 0; ok: 430; fail: 0; left: 0) [08:28:24] someone it managed to pass [08:29:09] good. I've removed all workload from the probematic node, maybe that helped [08:29:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P65640 and previous config saved to /var/cache/conftool/dbconfig/20240702-082918-marostegui.json [08:30:03] jayme: I apologize I should have checked on this channel before starting [08:30:05] I'll set it to inactive anyways. AIUI that should nowdays prevent scap from trying to pull the image there [08:30:07] (03PS1) 10Slyngshede: LDAP key sync: Improvements to SSH key sync with LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1051293 (https://phabricator.wikimedia.org/T366525) [08:30:07] jayme: I can sync my patch later [08:30:15] I am running the train over a Google Meet with Arnaud this morning, and did not look at IRC :/ [08:30:31] I just don't know if it's problematic to have a patch in mediawiki-config merged that is not actually deployed [08:30:40] !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=kubernetes1051.eqiad.wmnet [08:30:52] I have no idea [08:31:06] my guess is that if it is not pulled on the dpeloyment server it is not included in the image [08:32:34] hashar: should I try to sync it again now? [08:33:05] the train is going on [08:33:14] 08:33:09 K8s deployment progress: 67% (ok: 1455; fail: 0; left: 697) | [08:34:13] kostajh: so your patch got merged, I ran `scap train` which sends a patch to mediawiki-config to switch the versions [08:34:15] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru [08:34:21] which comes AFTER your patch [08:34:30] and thus I am currently deploying your config change [08:34:37] (as well as switching the group0 wikis) [08:34:45] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.12 refs T366957 [08:34:48] T366957: 1.43.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T366957 [08:35:06] kostajh: should be good now [08:35:13] sorry for the screw up :-\\\\ [08:35:55] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943442 (10SGupta-WMF) Thank you @scott_french for detailed explanation , I am... [08:36:13] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru [08:38:11] hashar: ah ok, so I don't need to do anything else? [08:38:19] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp6009.*} and A:cp [08:38:39] (03CR) 10Elukey: [C:03+2] profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:39:05] kostajh: nop! I have sneakily deployed it! [08:40:50] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp6009.*} and A:cp [08:43:30] FIRING: [2x] ProbeDown: Service wdqs2018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:44:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T367856)', diff saved to https://phabricator.wikimedia.org/P65641 and previous config saved to /var/cache/conftool/dbconfig/20240702-084425-marostegui.json [08:44:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [08:44:28] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:44:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [08:44:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T367856)', diff saved to https://phabricator.wikimedia.org/P65642 and previous config saved to /var/cache/conftool/dbconfig/20240702-084447-marostegui.json [08:45:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:45:37] hashar: excellent :) thx [08:45:52] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:46:28] (03CR) 10Urbanecm: [C:04-1] "FWIW, this is not the requirement. It is perfectly fine to add settings to WMF config that are not yet in extension.json, especially when " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno) [08:47:12] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52340 bytes in 1.868 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:47:30] (03CR) 10Elukey: Homer: fix Netbox 4 breaking changes (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:47:41] (03CR) 10Vgutierrez: "ok... don't forget to update the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [08:47:42] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:48:25] (03CR) 10Elukey: [C:03+1] "Didn't check all the details but if the code is tested and works, LGTM!" [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:48:30] RESOLVED: [2x] ProbeDown: Service wdqs2018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:48:34] (03CR) 10Vgutierrez: benthos:cache: encode problematic fields as hex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [08:50:14] (03CR) 10Elukey: [C:03+1] "LGTM, the only nit that would be great is to add comments where we use [0] to indicate why. In the future all reviewers will be happy to a" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1050379 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:53:07] (03CR) 10Elukey: "Just to double check - all of this works with python3-pynetbox 6.6 right? Or do we need to test it somewhere?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:54:38] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943487 (10mforns) > The service is up and running in staging, and can be reac... [08:54:47] (03PS2) 10Slyngshede: LDAP key sync: Improvements to SSH key sync with LDAP. [software/bitu] - 10https://gerrit.wikimedia.org/r/1051293 (https://phabricator.wikimedia.org/T366525) [08:57:34] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1025 at 10% weight T363812', diff saved to https://phabricator.wikimedia.org/P65643 and previous config saved to /var/cache/conftool/dbconfig/20240702-085733-jynus.json [08:57:37] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [08:59:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9943498 (10cmooney) 05Resolved→03Open [09:00:32] (03CR) 10Vgutierrez: [C:04-1] "this sounds a lot like https://community.letsencrypt.org/t/apache-chain-issues-with-dual-rsa-ecdsa-certificates/153960. Please use `SSLCer" [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) (owner: 10Filippo Giunchedi) [09:02:44] (03CR) 10Elukey: "Quick question to better understand the code - it would be nice to avoid using the [0] selector throughout the code, since we know that we" [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:03:20] (03CR) 10Vgutierrez: [C:04-1] "BTW the behavior that you're describing is well-known an documented on the Apache httpd documentation in https://httpd.apache.org/docs/cur" [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) (owner: 10Filippo Giunchedi) [09:10:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9943509 (10cmooney) >>! In T363341#9936269, @Jclark-ctr wrote: > cloudcephosd1039 > 2nd cable serial#20220008 port 1 > cloudcephosd1040 > 2nd cable serial#... [09:15:09] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1025 at 50% weight T363812', diff saved to https://phabricator.wikimedia.org/P65644 and previous config saved to /var/cache/conftool/dbconfig/20240702-091508-jynus.json [09:15:12] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [09:15:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9943513 (10cmooney) 05Open→03Resolved [09:17:36] (03CR) 10Volans: "I did a quick pass Arnold. The general approach looks good, nothing major. I've left few minor suggestions inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [09:20:13] !log brouberol@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [09:22:03] (03PS1) 10Volans: Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298 [09:22:03] (03PS1) 10Volans: Images: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 [09:23:06] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9943538 (10cmooney) So the change to the timeout has made a big difference, but there are still some small gaps: {F56165130} {F5616524... [09:23:43] (03CR) 10Jelto: [C:03+2] gitlab-settings: v1.6.0 for squash commit templates [puppet] - 10https://gerrit.wikimedia.org/r/1051178 (https://phabricator.wikimedia.org/T366624) (owner: 10Brennen Bearnes) [09:24:44] (03PS1) 10Cathal Mooney: Increase scrape_timeout for gnmic prometheus to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1051300 (https://phabricator.wikimedia.org/T326322) [09:26:25] RESOLVED: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:47] (03PS2) 10Filippo Giunchedi: librenms: serve chained LE certs [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) [09:28:45] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9943566 (10elukey) [09:29:21] (03CR) 10Vgutierrez: [C:03+1] librenms: serve chained LE certs [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) (owner: 10Filippo Giunchedi) [09:29:49] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] librenms: serve chained LE certs [puppet] - 10https://gerrit.wikimedia.org/r/1051249 (https://phabricator.wikimedia.org/T369008) (owner: 10Filippo Giunchedi) [09:33:02] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943578 (10Sfaci) Great explanation @Scott_French!. I didn't know that. We'll... [09:34:21] (03PS1) 10Filippo Giunchedi: o11y: serve LE chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/1051301 (https://phabricator.wikimedia.org/T369014) [09:34:41] (03CR) 10Volans: "Nice! Couple of suggestions inline, but I agree with the approach." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [09:36:28] (03CR) 10Vgutierrez: [C:03+1] o11y: serve LE chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/1051301 (https://phabricator.wikimedia.org/T369014) (owner: 10Filippo Giunchedi) [09:38:09] (03PS1) 10Vgutierrez: gerrit: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014) [09:39:47] (03CR) 10Filippo Giunchedi: [C:03+2] o11y: serve LE chained certificates [puppet] - 10https://gerrit.wikimedia.org/r/1051301 (https://phabricator.wikimedia.org/T369014) (owner: 10Filippo Giunchedi) [09:40:40] (03PS1) 10Vgutierrez: mirrors: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051307 (https://phabricator.wikimedia.org/T369014) [09:41:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:28] (03CR) 10Hashar: [C:03+1] gerrit: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez) [09:45:51] (03PS1) 10Vgutierrez: orchestrator: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051310 (https://phabricator.wikimedia.org/T369014) [09:46:33] marostegui: ^^ could you take care of getting that CR reviewed from somebody in your team? [09:46:41] s/from/by/ [09:47:02] (03CR) 10Filippo Giunchedi: [C:03+1] Increase scrape_timeout for gnmic prometheus to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1051300 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney) [09:47:45] (03CR) 10EoghanGaffney: [C:03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1051107 (https://phabricator.wikimedia.org/T367501) (owner: 10Jelto) [09:47:53] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubemaster[2001-2002].codfw.wmnet with reason: decom [09:48:05] (03PS1) 10Jelto: gitlab: use ensure latest when cloning the gitlab-exporter repo [puppet] - 10https://gerrit.wikimedia.org/r/1051312 (https://phabricator.wikimedia.org/T354656) [09:48:08] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubemaster[2001-2002].codfw.wmnet with reason: decom [09:48:33] (03CR) 10Hashar: [C:03+1] "I don't know how sensible this change is since I don't know much about certificates." [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez) [09:50:13] vgutierrez: checking [09:50:27] marostegui: thx <3 [09:50:51] (03CR) 10Marostegui: [C:03+1] orchestrator: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051310 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez) [09:52:21] !log volatile dir on puppetserver1001 with the new point release (12.6) for Bookworm [09:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:39] (03CR) 10Marostegui: [C:03+1] "Once merged let me know, so I can double check that everything works as expected" [puppet] - 10https://gerrit.wikimedia.org/r/1051310 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez) [09:53:53] !log jiji@cumin1002 conftool action : set/pooled=no; selector: name=kubemaster200[1-2].codfw.wmnet [09:56:44] (03CR) 10Clément Goubert: [C:03+1] services: update thumbor-plugin Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051150 (owner: 10Elukey) [09:58:55] (03PS2) 10Volans: Images: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1000) [10:00:39] (03CR) 10EoghanGaffney: [C:03+1] gitlab: use ensure latest when cloning the gitlab-exporter repo [puppet] - 10https://gerrit.wikimedia.org/r/1051312 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [10:01:40] RECOVERY - BFD status on cr1-drmrs is OK: UP: 0 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:02:06] !log homer 'cr*codfw*' commit 'T351074' [10:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:11] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [10:03:41] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1051307 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez) [10:04:13] (03CR) 10Vgutierrez: "the change at the moment should be a NOOP for gerrit. But if we don't deploy it as soon as acme-chief renews gerrit certificate (it should" [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez) [10:06:36] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool es1025 at 100% weight T363812', diff saved to https://phabricator.wikimedia.org/P65645 and previous config saved to /var/cache/conftool/dbconfig/20240702-100636-jynus.json [10:06:39] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [10:18:34] (03PS1) 10Elukey: profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 [10:21:02] (03PS2) 10Elukey: profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 [10:21:13] (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051292 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [10:21:18] (03CR) 10Cathal Mooney: [C:03+2] Increase scrape_timeout for gnmic prometheus to 30s [puppet] - 10https://gerrit.wikimedia.org/r/1051300 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney) [10:21:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:35] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey) [10:23:03] (03CR) 10Elukey: profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey) [10:25:40] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-master1003.eqiad.wmnet [10:26:17] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051292 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [10:27:35] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad [10:27:54] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad [10:28:30] !log upgrading A:cp-eqiad to haproxy 2.8.10 (T367756) [10:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:33] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [10:32:55] !log brouberol@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:dse-k8s-worker [10:34:45] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1003.eqiad.wmnet [10:35:48] !log brouberol@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [10:36:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:50] (03CR) 10Clément Goubert: [C:03+1] profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey) [10:39:41] (03PS1) 10Effie Mouzeli: kubernetes: retire kubemaster200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1051321 (https://phabricator.wikimedia.org/T353464) [10:41:46] (03PS3) 10Volans: data.yaml: Add daphnesmit to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) (owner: 10Slyngshede) [10:41:51] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:42:01] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:42:24] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:42:43] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:42:51] (03CR) 10Volans: [C:03+2] "Rebased resolving conflicts. Approved on task. Merging." [puppet] - 10https://gerrit.wikimedia.org/r/1049391 (https://phabricator.wikimedia.org/T368159) (owner: 10Slyngshede) [10:43:53] (03PS3) 10Elukey: Allow to save new OS names without them being present on the DB [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) [10:44:49] (03PS2) 10Effie Mouzeli: kubernetes: retire kubemaster200[1-2] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051321 (https://phabricator.wikimedia.org/T353464) [10:45:22] (03CR) 10Elukey: "Thanks a lot for the review!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [10:46:01] (03CR) 10Clément Goubert: [C:03+1] kubernetes: retire kubemaster200[1-2] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051321 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [10:46:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65646 and previous config saved to /var/cache/conftool/dbconfig/20240702-104605-root.json [10:47:32] (03PS1) 10Effie Mouzeli: kubernetes: retire kubemaster100[1-2] in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051323 (https://phabricator.wikimedia.org/T353464) [10:48:06] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9943798 (10Marostegui) Just one addition: sanitarium hosts also have replication filters to exclude tables or entire databases (private wikis). [10:48:10] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for daphnesmit/Daphne Smit/DSmit-WMF - https://phabricator.wikimedia.org/T368159#9943785 (10Volans) 05Open→03Resolved The above patch has been merged. Within 30 minutes it will be effective. Resolving the task. Feel fre... [10:48:23] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: retire kubemaster200[1-2] in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051321 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [10:49:42] (03CR) 10Volans: [C:03+1] "LGTM!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [10:50:29] !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubemaster[2001-2002].codfw.wmnet [10:54:09] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9943809 (10Clement_Goubert) >>! In T361835#9943486, @mforns wrote: >> The serv... [10:56:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:56:51] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [10:59:03] (03CR) 10Ayounsi: "yeah exactly. The end goal is to have all Netbox API calls in a spicerack module, and avoid direct calls from cookbooks. For example with " [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:00:14] (03CR) 10Ayounsi: "I haven't tested it, but I tested similar changes in Homer and that works on Pynetbox 6.6" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050453 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:00:47] (03CR) 10Elukey: Allow to save new OS names without them being present on the DB (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [11:00:54] (03CR) 10Elukey: Allow to save new OS names without them being present on the DB (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [11:01:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65647 and previous config saved to /var/cache/conftool/dbconfig/20240702-110111-root.json [11:03:55] (03PS5) 10Jcrespo: dbbackups: Set sql_mode as loose so that invalid enum values are ok [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) [11:03:55] (03PS1) 10Jcrespo: backup: Reduce the maximum amount of volumes for es-rw pools [puppet] - 10https://gerrit.wikimedia.org/r/1051324 [11:04:10] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1051325 (https://phabricator.wikimedia.org/T369020) [11:04:15] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1051326 (https://phabricator.wikimedia.org/T369020) [11:04:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T364069)', diff saved to https://phabricator.wikimedia.org/P65648 and previous config saved to /var/cache/conftool/dbconfig/20240702-110442-marostegui.json [11:04:46] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:06:43] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1051327 (https://phabricator.wikimedia.org/T369021) [11:07:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T369021 [11:07:38] T369021: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T369021 [11:07:41] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubemaster[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [11:07:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2214 with weight 0 T369021', diff saved to https://phabricator.wikimedia.org/P65649 and previous config saved to /var/cache/conftool/dbconfig/20240702-110750-root.json [11:07:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T369021 [11:08:37] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1051327 (https://phabricator.wikimedia.org/T369021) (owner: 10Gerrit maintenance bot) [11:10:28] (03CR) 10Jcrespo: [C:03+2] dbbackups: Set sql_mode as loose so that invalid enum values are ok [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) (owner: 10Jcrespo) [11:10:28] (03PS1) 10Clément Goubert: kubernetes: move 5 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1051328 (https://phabricator.wikimedia.org/T351074) [11:10:53] (03CR) 10Jcrespo: [C:03+2] backup: Reduce the maximum amount of volumes for es-rw pools [puppet] - 10https://gerrit.wikimedia.org/r/1051324 (owner: 10Jcrespo) [11:11:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubemaster[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [11:11:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:11:26] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubemaster[2001-2002].codfw.wmnet [11:11:46] (03PS1) 10Elukey: cloud: add default for profile::puppetserver::git::exclude_servers [puppet] - 10https://gerrit.wikimedia.org/r/1051329 [11:12:18] (03CR) 10David Caro: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1051329 (owner: 10Elukey) [11:12:24] (03CR) 10Elukey: [C:03+2] cloud: add default for profile::puppetserver::git::exclude_servers [puppet] - 10https://gerrit.wikimedia.org/r/1051329 (owner: 10Elukey) [11:12:29] !log pooling and uncordoning wikikube-worker2025.codfw.wmnet|wikikube-worker2026.codfw.wmnet|wikikube-worker2027.codfw.wmnet|wikikube-worker2028.codfw.wmnet|wikikube-worker2029.codfw.wmnet - T351074 [11:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:31] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:12:38] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad [11:12:39] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2025.codfw.wmnet|wikikube-worker2026.codfw.wmnet|wikikube-worker2027.codfw.wmnet|wikikube-worker2028.codfw.wmnet|wikikube-worker2029.codfw.wmnet),cluster=kubernetes,service=kubesvc [11:14:25] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad [11:14:54] (03CR) 10Marostegui: "Sorry I missed this!" [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) (owner: 10Jcrespo) [11:16:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65650 and previous config saved to /var/cache/conftool/dbconfig/20240702-111616-root.json [11:16:27] (03PS2) 10Sergio Gimeno: GrowthExperiments: add community updates module flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) [11:16:46] (03PS1) 10Jcrespo: backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330 [11:17:06] (03CR) 10Jcrespo: [C:04-1] backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330 (owner: 10Jcrespo) [11:17:41] (03PS2) 10Jcrespo: backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330 [11:17:53] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1008.eqiad.wmnet with OS bullseye [11:17:53] (03PS3) 10Jcrespo: backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330 [11:19:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P65651 and previous config saved to /var/cache/conftool/dbconfig/20240702-111949-marostegui.json [11:20:45] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:21:05] !log Uncordoning wikikube-ctrl2001.codfw.wmnet and wikikube-ctrl2002.codfw.wmnet [11:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:50] (03CR) 10Sergio Gimeno: "Right, ty. I was not sure if going with a default of true or false at the time of writing this patch. Derived from the fact of deciding if" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno) [11:21:54] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:21:57] !log jayme@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubernetes1051.eqiad.wmnet [11:22:02] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad [11:22:05] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad [11:22:29] (03PS8) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [11:22:29] (03PS1) 10David Caro: ci: enable failing when hiera missing from cloud [puppet] - 10https://gerrit.wikimedia.org/r/1051332 [11:22:32] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:22:42] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host eventlog1003.eqiad.wmnet [11:23:10] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:24:42] !log switched wikikube production clusters from PSP to PSS for restricted namespaces - T273507 [11:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:45] T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21 - https://phabricator.wikimedia.org/T273507 [11:24:56] !log Starting s6 codfw failover from db2129 to db2214 - T369021 [11:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:59] T369021: Switchover s6 master (db2129 -> db2214) - https://phabricator.wikimedia.org/T369021 [11:25:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2214 to s6 primary T369021', diff saved to https://phabricator.wikimedia.org/P65652 and previous config saved to /var/cache/conftool/dbconfig/20240702-112518-marostegui.json [11:26:04] (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [11:26:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129 T369021', diff saved to https://phabricator.wikimedia.org/P65653 and previous config saved to /var/cache/conftool/dbconfig/20240702-112616-root.json [11:26:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:27] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:26:28] (03CR) 10David Caro: "this was removed from voting in I41fe8738c4d15beecb70753ed7dd76fcea85405a" [puppet] - 10https://gerrit.wikimedia.org/r/1051332 (owner: 10David Caro) [11:26:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host eventlog1003.eqiad.wmnet [11:27:13] !log brouberol@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [11:27:58] (03PS9) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [11:31:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65654 and previous config saved to /var/cache/conftool/dbconfig/20240702-113122-root.json [11:31:25] (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [11:34:05] (03CR) 10Jcrespo: [C:03+2] backup: Increase the maximum amount of volumes for es-readonly [puppet] - 10https://gerrit.wikimedia.org/r/1051330 (owner: 10Jcrespo) [11:34:43] (03PS1) 10Marostegui: db2114: No longer a candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1051333 (https://phabricator.wikimedia.org/T362948) [11:34:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P65655 and previous config saved to /var/cache/conftool/dbconfig/20240702-113457-marostegui.json [11:35:11] (03CR) 10Marostegui: [C:03+2] db2114: No longer a candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1051333 (https://phabricator.wikimedia.org/T362948) (owner: 10Marostegui) [11:36:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Long schema change [11:36:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Long schema change [11:37:33] !log brouberol@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [11:37:55] (03CR) 10Jelto: [C:03+2] gitlab: use ensure latest when cloning the gitlab-exporter repo [puppet] - 10https://gerrit.wikimedia.org/r/1051312 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:40:44] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: move 5 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1051328 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:41:12] (03CR) 10Clément Goubert: [C:03+2] kubernetes: move 5 appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1051328 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:41:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:32] PROBLEM - Host kubernetes1051 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:33] FIRING: KubernetesCalicoDown: kubernetes1051.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1051.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:43:05] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2307 to wikikube-worker2030 [11:43:11] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:43:42] (03PS1) 10Jelto: gitlab: use dedicated ensure for gitlab-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1051335 (https://phabricator.wikimedia.org/T354656) [11:44:35] !log root@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1008.eqiad.wmnet with OS bullseye [11:46:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65656 and previous config saved to /var/cache/conftool/dbconfig/20240702-114627-root.json [11:46:31] (03CR) 10Jforrester: [C:03+2] Reference widget: check for undefined config [extensions/WikibaseMediaInfo] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051202 (https://phabricator.wikimedia.org/T368736) (owner: 10Jforrester) [11:48:26] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1008.eqiad.wmnet with OS bullseye [11:50:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T364069)', diff saved to https://phabricator.wikimedia.org/P65657 and previous config saved to /var/cache/conftool/dbconfig/20240702-115003-marostegui.json [11:50:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [11:50:07] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:50:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [11:50:25] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2307 to wikikube-worker2030 - cgoubert@cumin1002" [11:50:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65658 and previous config saved to /var/cache/conftool/dbconfig/20240702-115026-marostegui.json [11:52:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2307 to wikikube-worker2030 - cgoubert@cumin1002" [11:52:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:52:54] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2030 [11:53:29] (03PS1) 10Jcrespo: dbbackups: Reenable es-readonly for one-time es5 section backup [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) [11:54:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2030 [11:54:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2307 to wikikube-worker2030 [11:55:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2309 to wikikube-worker2031 [11:55:09] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:57:03] (03CR) 10CI reject: [V:04-1] dbbackups: Reenable es-readonly for one-time es5 section backup [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [11:57:29] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2309 to wikikube-worker2031 - cgoubert@cumin1002" [11:58:16] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [11:58:29] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [11:58:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2309 to wikikube-worker2031 - cgoubert@cumin1002" [11:58:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:58:41] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2031 [11:58:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2031 [11:58:54] (03CR) 10EoghanGaffney: [C:03+1] gitlab: use dedicated ensure for gitlab-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1051335 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:59:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2309 to wikikube-worker2031 [11:59:07] (03PS1) 10Ayounsi: Preseed: set /32 netmask for virtual ranges [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) [11:59:24] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [11:59:25] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2365 to wikikube-worker2032 [11:59:32] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:59:37] (03PS1) 10Jforrester: Drop bare-metal servers from Wikimedia Debug tool config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051343 (https://phabricator.wikimedia.org/T367949) [11:59:41] (03PS1) 10Jforrester: mwdebug: Change various uses to mw-on-k8s version [puppet] - 10https://gerrit.wikimedia.org/r/1051344 [11:59:42] (03PS1) 10Jforrester: mwdebug: Drop mwdebug\d{4} for bare metal servers [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949) [11:59:48] (03CR) 10Jelto: [C:03+2] gitlab: use dedicated ensure for gitlab-exporter [puppet] - 10https://gerrit.wikimedia.org/r/1051335 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [12:00:01] I got CI error on profile::gitlab, any recent change there? [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1200) [12:00:24] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:00:25] jynus: fix is merging. should be fixed in a sec / after rebase [12:00:25] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:00:32] jelto: no worries then [12:00:47] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:00:49] I was just confused because my patch was so trivial! [12:00:59] marostegui: merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1051310 now [12:01:07] (03CR) 10Vgutierrez: [C:03+2] orchestrator: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051310 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez) [12:01:29] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:01:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65659 and previous config saved to /var/cache/conftool/dbconfig/20240702-120133-root.json [12:01:34] Ok vgutierrez [12:01:39] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [12:01:59] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2365 to wikikube-worker2032 - cgoubert@cumin1002" [12:02:25] puppet/CI/pcc for profile::gitlab should be happy again [12:02:30] (03CR) 10CI reject: [V:04-1] Preseed: set /32 netmask for virtual ranges [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:03:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2365 to wikikube-worker2032 - cgoubert@cumin1002" [12:03:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:03:15] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2032 [12:03:17] (03CR) 10Vgutierrez: [C:03+2] mirrors: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051307 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez) [12:04:21] (03CR) 10CI reject: [V:04-1] mwdebug: Change various uses to mw-on-k8s version [puppet] - 10https://gerrit.wikimedia.org/r/1051344 (owner: 10Jforrester) [12:04:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2032 [12:04:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2365 to wikikube-worker2032 [12:05:06] (03PS2) 10Jforrester: [wikifunctions] Grant wikifunctions-staff enum and converter rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610) [12:05:08] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage [12:05:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610) (owner: 10Jforrester) [12:05:32] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [12:05:33] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2392 to wikikube-worker2033 [12:05:40] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:05:50] (03CR) 10CI reject: [V:04-1] mwdebug: Drop mwdebug\d{4} for bare metal servers [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [12:07:25] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad [12:07:41] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage [12:08:08] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2392 to wikikube-worker2033 - cgoubert@cumin1002" [12:08:50] (03CR) 10Marostegui: [C:03+1] dbbackups: Reenable es-readonly for one-time es5 section backup [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [12:09:13] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad [12:09:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2392 to wikikube-worker2033 - cgoubert@cumin1002" [12:09:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:09:25] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2033 [12:09:27] (03PS10) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [12:09:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2033 [12:09:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2392 to wikikube-worker2033 [12:10:49] (03CR) 10Jcrespo: [C:03+2] dbbackups: Reenable es-readonly for one-time es5 section backup [puppet] - 10https://gerrit.wikimedia.org/r/1051341 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [12:11:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2393 to wikikube-worker2034 [12:11:17] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:12:22] (03PS2) 10Ayounsi: Preseed: set /32 netmask for virtual ranges [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) [12:12:47] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:14:13] (03Merged) 10jenkins-bot: Reference widget: check for undefined config [extensions/WikibaseMediaInfo] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051202 (https://phabricator.wikimedia.org/T368736) (owner: 10Jforrester) [12:14:54] !log jforrester@deploy1002 Started scap sync-world: Backport for [[gerrit:1051202|Reference widget: check for undefined config (T368736)]] [12:14:57] T368736: Structured Data add reference not working - https://phabricator.wikimedia.org/T368736 [12:15:29] (03CR) 10David Caro: "Ready for reviews, passes the tests and passes in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [12:15:39] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9944101 (10ABran-WMF) [[ https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-d... [12:15:51] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2393 to wikikube-worker2034 - cgoubert@cumin1002" [12:15:54] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:16:01] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:16:15] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on lists1001.wikimedia.org with reason: Pre-decommissioning lists1001 [12:16:18] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on lists1001.wikimedia.org with reason: Pre-decommissioning lists1001 [12:16:34] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9944103 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=410ac7b2-3327-4734-8665-8ceb56bdc810) set by eoghan@cumin1002 fo... [12:16:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65660 and previous config saved to /var/cache/conftool/dbconfig/20240702-121638-root.json [12:17:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2393 to wikikube-worker2034 - cgoubert@cumin1002" [12:17:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:17:13] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2034 [12:17:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2034 [12:17:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2393 to wikikube-worker2034 [12:17:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2030.codfw.wmnet with OS bullseye [12:17:54] (03CR) 10David Caro: [C:03+2] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [12:18:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2031.codfw.wmnet with OS bullseye [12:18:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2032.codfw.wmnet with OS bullseye [12:18:50] (03CR) 10Ayounsi: "D-I bug fixed and deployed in bookworm-installer - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1064005" [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:18:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2033.codfw.wmnet with OS bullseye [12:19:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2034.codfw.wmnet with OS bullseye [12:19:17] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1051202|Reference widget: check for undefined config (T368736)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:19:41] !log jforrester@deploy1002 jforrester: Continuing with sync [12:19:49] (03CR) 10David Caro: [C:03+2] "Passing in tools too:" [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [12:19:56] (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1051348 (https://phabricator.wikimedia.org/T367756) [12:20:50] PROBLEM - Host mw2307 is DOWN: PING CRITICAL - Packet loss = 100% [12:21:20] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9944135 (10eoghan) lists1001 has been powered off, it will stay off for 1 week and then I'll decommission it fully on Tuesday, 9th July, aft... [12:22:14] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9944140 (10eoghan) 05In progress→03Resolved I think we can close this, since the puppet module now instal... [12:22:57] (03CR) 10Brouberol: [C:03+1] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:24:53] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1051202|Reference widget: check for undefined config (T368736)]] (duration: 09m 59s) [12:24:56] T368736: Structured Data add reference not working - https://phabricator.wikimedia.org/T368736 [12:25:02] !log brouberol@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:25:45] !log Deploy schema change on db2129 s6 codfw dbmaint T367856 [12:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:47] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:25:51] !log brouberol@cumin1002 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes [12:25:52] RECOVERY - Host mw2307 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [12:28:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:28:26] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:28:44] (03CR) 10Vgutierrez: [C:03+2] gerrit: Stop using SSLCertificateChainFile [puppet] - 10https://gerrit.wikimedia.org/r/1051304 (https://phabricator.wikimedia.org/T369014) (owner: 10Vgutierrez) [12:30:20] (03CR) 10Ayounsi: [C:03+2] Preseed: set /32 netmask for virtual ranges [puppet] - 10https://gerrit.wikimedia.org/r/1051342 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:30:51] PROBLEM - Host mw2307 is DOWN: PING CRITICAL - Packet loss = 100% [12:31:19] RECOVERY - Host mw2307 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [12:31:27] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:33:59] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage [12:34:06] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2031.codfw.wmnet with reason: host reimage [12:34:12] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage [12:34:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2033.codfw.wmnet with reason: host reimage [12:34:18] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2032.codfw.wmnet with reason: host reimage [12:34:41] (03PS1) 10Arturo Borrero Gonzalez: toolforge: prometheus: scrape kyverno metrics [puppet] - 10https://gerrit.wikimedia.org/r/1051351 (https://phabricator.wikimedia.org/T368515) [12:35:32] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051351 (https://phabricator.wikimedia.org/T368515) (owner: 10Arturo Borrero Gonzalez) [12:36:25] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage [12:36:54] (03Abandoned) 10RhinosF1: remove s10 references [software/conftool] - 10https://gerrit.wikimedia.org/r/708632 (https://phabricator.wikimedia.org/T167973) (owner: 10RhinosF1) [12:37:22] (03Abandoned) 10RhinosF1: test [puppet] - 10https://gerrit.wikimedia.org/r/980470 (owner: 10RhinosF1) [12:39:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2031.codfw.wmnet with reason: host reimage [12:40:21] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: prometheus: scrape kyverno metrics [puppet] - 10https://gerrit.wikimedia.org/r/1051351 (https://phabricator.wikimedia.org/T368515) (owner: 10Arturo Borrero Gonzalez) [12:40:25] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:40:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:41:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:41:45] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9944322 (10JMeybohm) [12:41:51] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host kubernetes1051.eqiad.wmnet [12:41:53] (03PS1) 10Ayounsi: Routed ganeti: remove /23 -> /32 workaround [puppet] - 10https://gerrit.wikimedia.org/r/1051352 [12:42:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage [12:43:12] (03CR) 10Ayounsi: [C:03+2] Routed ganeti: remove /23 -> /32 workaround [puppet] - 10https://gerrit.wikimedia.org/r/1051352 (owner: 10Ayounsi) [12:44:16] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:20] !log decom eqiad old kubemasters - T353464 [12:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:26] T353464: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464 [12:45:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T367856)', diff saved to https://phabricator.wikimedia.org/P65661 and previous config saved to /var/cache/conftool/dbconfig/20240702-124517-marostegui.json [12:45:25] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:45:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2033.codfw.wmnet with reason: host reimage [12:46:04] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubemaster[1001-1002].eqiad.wmnet with reason: decom [12:46:18] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubemaster[1001-1002].eqiad.wmnet with reason: decom [12:49:09] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [12:49:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2032.codfw.wmnet with reason: host reimage [12:49:50] (03CR) 10Kamila Součková: [C:03+1] kubernetes: retire kubemaster100[1-2] in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051323 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [12:49:56] !log jiji@cumin1002 conftool action : set/pooled=no; selector: name=kubemaster100[1-2].eqiad.wmnet [12:50:53] PROBLEM - Host mw2307 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:23] RECOVERY - Host mw2307 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [12:53:33] PROBLEM - Host mw2309 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:14] (03CR) 10Urbanecm: [C:03+1] GrowthExperiments: add community updates module flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno) [12:55:14] !log jiji@cumin1002 conftool action : set/pooled=inactive; selector: name=kubemaster100[1-2].eqiad.wmnet [12:56:03] RECOVERY - Host mw2309 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [12:56:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2030.codfw.wmnet with OS bullseye [12:57:00] (03CR) 10Lucas Werkmeister (WMDE): Enable EntitySchema data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE)) [12:57:39] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:59:22] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2031.codfw.wmnet with OS bullseye [12:59:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2034.codfw.wmnet with OS bullseye [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1300). [13:00:05] Lucas_WMDE and James_F: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] o/ [13:00:09] I can deploy! [13:00:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P65662 and previous config saved to /var/cache/conftool/dbconfig/20240702-130024-marostegui.json [13:00:55] (03PS3) 10Lucas Werkmeister (WMDE): Enable EntitySchema data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) [13:01:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:01:34] Lucas_WMDE: you beated me to it. FWIW, I added a last-time addition. [13:01:49] nothing to test on that, feel free to ship it with something else if needed. [13:02:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE)) [13:02:43] urbanecm: alright, looking [13:03:12] (03Merged) 10jenkins-bot: Enable EntitySchema data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE)) [13:03:18] okay, looks fine, I’ll deploy that together with the wikifunctions change then [13:03:22] unless James_F wants to do that one [13:03:34] Nah, I'll stand back and let you sling them out together. [13:03:38] ok ^^ [13:03:41] (03CR) 10Elukey: [C:03+2] Allow to save new OS names without them being present on the DB [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [13:03:41] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1042209|Enable EntitySchema data type on Wikidata (T332157)]] [13:03:44] T332157: [ES-M2]: Enable new EntitySchema data type on Wikidata - https://phabricator.wikimedia.org/T332157 [13:03:52] (03CR) 10Arnaudb: [C:03+1] mariadb: enable pt-heartbeat monitoring through mysqld-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [13:04:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2033.codfw.wmnet with OS bullseye [13:05:28] (03Merged) 10jenkins-bot: Allow to save new OS names without them being present on the DB [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1049966 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [13:06:18] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1042209|Enable EntitySchema data type on Wikidata (T332157)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:07:31] (03PS2) 10Volans: Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298 [13:08:14] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [13:08:25] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [13:08:28] (03CR) 10TChin: EventStreamConfig: Add hive ingestion defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [13:09:00] testing… [13:09:16] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [13:09:21] !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubemaster[1001-1002].eqiad.wmnet [13:09:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2032.codfw.wmnet with OS bullseye [13:11:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:38] (03PS3) 10Volans: Images: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744) [13:14:36] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1042209|Enable EntitySchema data type on Wikidata (T332157)]] (duration: 10m 54s) [13:14:38] T332157: [ES-M2]: Enable new EntitySchema data type on Wikidata - https://phabricator.wikimedia.org/T332157 [13:15:18] (03CR) 10Filippo Giunchedi: mariadb: enable pt-heartbeat monitoring through mysqld-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [13:15:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P65663 and previous config saved to /var/cache/conftool/dbconfig/20240702-131531-marostegui.json [13:16:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610) (owner: 10Jforrester) [13:16:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno) [13:16:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:26] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [13:16:59] (03Merged) 10jenkins-bot: [wikifunctions] Grant wikifunctions-staff enum and converter rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048059 (https://phabricator.wikimedia.org/T366610) (owner: 10Jforrester) [13:17:02] (03Merged) 10jenkins-bot: GrowthExperiments: add community updates module flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047961 (https://phabricator.wikimedia.org/T365877) (owner: 10Sergio Gimeno) [13:17:31] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1048059|[wikifunctions] Grant wikifunctions-staff enum and converter rights (T366610 T367270)]], [[gerrit:1047961|GrowthExperiments: add community updates module flag (T365877)]] [13:17:36] T366610: Restrict creation of instances of Types with identity keys to wikilambda-create-enum-value - https://phabricator.wikimedia.org/T366610 [13:17:37] T367270: Add rights for creation and edition of type converters (Z46 and Z46) - https://phabricator.wikimedia.org/T367270 [13:17:37] T365877: Community updates module: Title & Body text - https://phabricator.wikimedia.org/T365877 [13:18:52] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubemaster[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [13:19:15] (03PS3) 10Volans: Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298 [13:19:15] (03PS4) 10Volans: Images: automatically migrate to a new OS [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744) [13:19:49] (03CR) 10Volans: "If we decide to go with this approach we can add the tests." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans) [13:20:03] !log lucaswerkmeister-wmde@deploy1002 sgimeno, jforrester, lucaswerkmeister-wmde: Backport for [[gerrit:1048059|[wikifunctions] Grant wikifunctions-staff enum and converter rights (T366610 T367270)]], [[gerrit:1047961|GrowthExperiments: add community updates module flag (T365877)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:20:28] (03PS8) 10Elukey: Switch php7.4-cli to bullseye and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) (owner: 10Jforrester) [13:21:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubemaster[1001-1002].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [13:21:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:21:25] FIRING: [7x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:26] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubemaster[1001-1002].eqiad.wmnet [13:21:52] James_F: want to test the permission changes? [13:21:56] Lucas_WMDE: Sure. [13:21:57] https://www.wikifunctions.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups|restrictions&format=json&formatversion=2 looks good to me, at least [13:22:33] (03CR) 10Elukey: "James I took the liberty to rebase and modify again the versions, IIUC from Joe the -sX suffix was only for security releases/concerns, so" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) (owner: 10Jforrester) [13:22:38] Lucas_WMDE: LGTM. [13:22:47] !log lucaswerkmeister-wmde@deploy1002 sgimeno, jforrester, lucaswerkmeister-wmde: Continuing with sync [13:22:51] !log homer 'cr*codfw*' commit 'T351074' [13:22:52] alright, thanks for testing! [13:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:53] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [13:23:23] (03PS1) 10Peter Fischer: Search update pipeline: reduce client-side rate-limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051358 (https://phabricator.wikimedia.org/T362310) [13:23:36] (03CR) 10Jforrester: "> James I took the liberty to rebase and modify again the versions, IIUC from Joe the -sX suffix was only for security releases/concerns, " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1021922 (https://phabricator.wikimedia.org/T356293) (owner: 10Jforrester) [13:26:24] (03PS2) 10Effie Mouzeli: kubernetes: retire kubemaster100[1-2] in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051323 (https://phabricator.wikimedia.org/T353464) [13:27:54] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1048059|[wikifunctions] Grant wikifunctions-staff enum and converter rights (T366610 T367270)]], [[gerrit:1047961|GrowthExperiments: add community updates module flag (T365877)]] (duration: 10m 22s) [13:28:00] T366610: Restrict creation of instances of Types with identity keys to wikilambda-create-enum-value - https://phabricator.wikimedia.org/T366610 [13:28:00] T367270: Add rights for creation and edition of type converters (Z46 and Z46) - https://phabricator.wikimedia.org/T367270 [13:28:02] T365877: Community updates module: Title & Body text - https://phabricator.wikimedia.org/T365877 [13:29:11] (03CR) 10Effie Mouzeli: [C:03+2] kubernetes: retire kubemaster100[1-2] in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1051323 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [13:29:24] (03PS1) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 [13:29:53] James_F, urbanecm: should be deployed now [13:29:58] thanks [13:30:08] (well, whenever beta next runs a config update, I guess ^^) [13:30:23] (03PS2) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) [13:30:27] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9944546 (10xcollazo) >>! In T361835#9943486, @mforns wrote: > ... >> The servi... [13:30:29] !log UTC afternoon backport+config window done [13:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T367856)', diff saved to https://phabricator.wikimedia.org/P65664 and previous config saved to /var/cache/conftool/dbconfig/20240702-133038-marostegui.json [13:30:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:30:43] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:30:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:31:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T367856)', diff saved to https://phabricator.wikimedia.org/P65665 and previous config saved to /var/cache/conftool/dbconfig/20240702-133100-marostegui.json [13:31:25] FIRING: [16x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:32] (03CR) 10Herron: [C:03+2] thanos: increase query frontend and store cache sizes [puppet] - 10https://gerrit.wikimedia.org/r/1051177 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [13:33:39] (03PS3) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) [13:35:29] !log Pooling and uncordoning wikikube-worker2030.codfw.wmnet wikikube-worker2031.codfw.wmnet wikikube-worker2032.codfw.wmnet wikikube-worker2033.codfw.wmnet wikikube-worker2034.codfw.wmnet - T351074 [13:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:32] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [13:35:51] (03PS4) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) [13:35:59] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker2030.codfw.wmnet|wikikube-worker2031.codfw.wmnet|wikikube-worker2032.codfw.wmnet|wikikube-worker2033.codfw.wmnet|wikikube-worker2034.codfw.wmnet),cluster=kubernetes,service=kubesvc [13:36:25] FIRING: [15x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:16] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:29] !log brouberol@cumin1002 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes [13:39:32] (03PS1) 10Jforrester: Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051364 (https://phabricator.wikimedia.org/T368892) [13:39:48] (03PS2) 10Jforrester: Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051364 (https://phabricator.wikimedia.org/T368892) [13:39:58] jouncebot: nowandnext [13:39:59] For the next 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1300) [13:39:59] In 1 hour(s) and 20 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1500) [13:40:08] (03CR) 10Jforrester: [C:03+2] Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051364 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester) [13:41:03] (03Merged) 10jenkins-bot: Revert "wikifunctions: Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051364 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester) [13:41:24] !log brouberol@cumin1002 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes [13:41:25] FIRING: [16x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:19] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:42:46] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:42:51] (03PS1) 10Effie Mouzeli: cumin: fix kube-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1051365 (https://phabricator.wikimedia.org/T353464) [13:43:08] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [13:43:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:44:02] (03CR) 10Aqu: [C:03+1] "Thanks. Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [13:44:19] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [13:44:56] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [13:45:27] (03CR) 10JHathaway: [C:03+1] "lgtm" [software/bitu] - 10https://gerrit.wikimedia.org/r/1051293 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede) [13:45:48] (03CR) 10Jgiannelos: "This is a bit tricky. The part were we remove the `exec` parts that send requests is straightforward. What I am not very confident is the " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [13:46:00] (03PS1) 10Ayounsi: DHCP: Add support for routed ganeti subnets [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) [13:46:03] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [13:46:14] PROBLEM - Host an-druid1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:46:16] RECOVERY - Host an-druid1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:46:19] (03CR) 10CI reject: [V:04-1] cumin: fix kube-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1051365 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [13:46:25] FIRING: [17x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:53] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:47:05] (03CR) 10JHathaway: [C:03+1] profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey) [13:47:25] (03PS2) 10Effie Mouzeli: cumin: fix kube-master aliases [puppet] - 10https://gerrit.wikimedia.org/r/1051365 (https://phabricator.wikimedia.org/T353464) [13:47:55] (03PS2) 10Ayounsi: DHCP: Add support for routed ganeti subnets [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) [13:48:01] (03PS1) 10Bking: wdqs: detune blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) [13:49:00] (03CR) 10Elukey: [C:03+2] profile::puppetmaster::updatenetboot: update help and install targets [puppet] - 10https://gerrit.wikimedia.org/r/1051317 (owner: 10Elukey) [13:49:58] (03PS1) 10Ssingh: varnish: make donate.m redirect permanent and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) [13:50:35] (03CR) 10Filippo Giunchedi: [C:03+2] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [13:50:40] (03CR) 10Filippo Giunchedi: [C:03+2] shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [13:50:43] (03CR) 10Filippo Giunchedi: [C:03+2] page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [13:51:25] FIRING: [17x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:42] (03CR) 10Elukey: [C:03+1] Cookbooks: fix Netbox 4 breaking changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:51:44] (03PS3) 10Filippo Giunchedi: page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) [13:51:49] (03PS1) 10Jforrester: wikifunctions: Re-apply "Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051373 [13:51:54] jouncebot: now and next [13:51:54] For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1300) [13:51:54] (03CR) 10Jforrester: [C:03+2] wikifunctions: Re-apply "Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051373 (owner: 10Jforrester) [13:52:32] (03PS1) 10Clément Goubert: mw-misc: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051367 (https://phabricator.wikimedia.org/T365265) [13:52:43] (03PS1) 10Clément Goubert: mw-wikifunctions: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051368 (https://phabricator.wikimedia.org/T365265) [13:53:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:55:25] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: dns6001: reduce anycast_hc logging level and backups [puppet] - 10https://gerrit.wikimedia.org/r/1050626 (owner: 10Ssingh) [13:56:05] (03PS3) 10Arnaudb: mysql: pt-heartbeat alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278) [13:56:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:38] (03CR) 10Arnaudb: mysql: pt-heartbeat alerting rules (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [13:56:46] (03CR) 10Elukey: [C:03+1] "Very nice! TIL managers in django :)" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298 (owner: 10Volans) [13:57:17] (03CR) 10CI reject: [V:04-1] mysql: pt-heartbeat alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [13:58:46] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9944795 (10cmooney) [13:58:47] !log decom old eqiad and codfw kubetcd hosts [13:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:42] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 145 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:59:59] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9944811 (10cmooney) [14:00:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051348 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [14:01:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:27] (03CR) 10CI reject: [V:04-1] wikifeeds: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043078 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:01:27] (03CR) 10CI reject: [V:04-1] shellboxen: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043085 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:01:51] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org [14:01:56] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] page-analytics: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043077 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:02:02] !log restart anycast-hc on dns6001 [14:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:19] (03Merged) 10jenkins-bot: wikifunctions: Re-apply "Upgrade orchestrator from 2024-06-11-223956 to 2024-06-17-221517" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051373 (owner: 10Jforrester) [14:02:35] (03CR) 10Elukey: "Looks really great, have you tried to dry-run it via test-cookbook to double check that everything looks good? (see https://wikitech.wikim" [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh) [14:03:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org [14:03:07] (03PS4) 10Arnaudb: mysql: pt-heartbeat alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278) [14:03:08] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:03:48] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:04:05] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:04:08] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=recdns [14:04:55] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [14:04:57] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [14:04:59] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [14:05:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking) [14:05:25] (03PS1) 10David Caro: replica_cnf: add tools to url [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909) [14:05:30] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [14:05:45] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:05:53] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:05:58] (03CR) 10Volans: [C:03+2] Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298 (owner: 10Volans) [14:05:59] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [14:06:18] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [14:06:20] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=recdns [14:06:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:49] (03PS2) 10David Caro: replica_cnf: add tools to url [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909) [14:06:49] (03PS1) 10David Caro: ceph: update the cloudcephosd1008 iface names [puppet] - 10https://gerrit.wikimedia.org/r/1051376 (https://phabricator.wikimedia.org/T348643) [14:07:01] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:07:30] (03CR) 10David Caro: [C:03+2] ceph: update the cloudcephosd1008 iface names [puppet] - 10https://gerrit.wikimedia.org/r/1051376 (https://phabricator.wikimedia.org/T348643) (owner: 10David Caro) [14:07:34] (03PS1) 10Ssingh: Revert "hiera: dns6001: reduce anycast_hc logging level and backups" [puppet] - 10https://gerrit.wikimedia.org/r/1051377 [14:10:10] (03CR) 10Ssingh: [C:03+2] Revert "hiera: dns6001: reduce anycast_hc logging level and backups" [puppet] - 10https://gerrit.wikimedia.org/r/1051377 (owner: 10Ssingh) [14:10:17] (03CR) 10CI reject: [V:04-1] replica_cnf: add tools to url [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909) (owner: 10David Caro) [14:11:07] (03PS2) 10Bking: wdqs: detune blackbox checks [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) [14:11:11] (03Merged) 10jenkins-bot: Images: take advantage of performance optimization [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051298 (owner: 10Volans) [14:11:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking) [14:11:27] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: decom [14:11:28] !log jiji@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2 days, 0:00:00 on 6 hosts with reason: decom [14:12:12] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: decom [14:12:19] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: decom [14:12:46] (03PS1) 10Elukey: docker::reporter: update exclude rules [puppet] - 10https://gerrit.wikimedia.org/r/1051379 (https://phabricator.wikimedia.org/T367427) [14:13:39] (03CR) 10Ssingh: "Thanks! I did test-cookbook -c 1049950 --dry-run sre.dns.roll-restart-ntp --reason 'testing dry run' --alias dnsbox restart_daemons. Outpu" [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh) [14:13:50] (03PS2) 10Elukey: docker::reporter: update exclude rules [puppet] - 10https://gerrit.wikimedia.org/r/1051379 (https://phabricator.wikimedia.org/T367427) [14:15:12] (03CR) 10Elukey: [C:03+1] cookbooks/sre/dns: add a cookbook for roll restart of ntpd.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh) [14:15:49] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1008.eqiad.wmnet with OS bullseye [14:15:56] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9944984 (10cmooney) [14:16:10] (03PS3) 10David Caro: replica_cnf: add tools to url [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909) [14:16:58] (03PS1) 10Effie Mouzeli: Remove kubetcd100 from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1051380 (https://phabricator.wikimedia.org/T353464) [14:19:31] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1008.eqiad.wmnet [14:19:49] (03PS1) 10Ssingh: dnsbox: set anycast-hc num_backups to one [puppet] - 10https://gerrit.wikimedia.org/r/1051381 [14:20:11] (03CR) 10Ssingh: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh) [14:21:04] (03PS2) 10Ssingh: dnsbox: set anycast-hc num_backups to one [puppet] - 10https://gerrit.wikimedia.org/r/1051381 [14:21:39] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3138/co" [puppet] - 10https://gerrit.wikimedia.org/r/1051381 (owner: 10Ssingh) [14:22:28] (03PS6) 10Arturo Borrero Gonzalez: openstack: nova-compute: remove support for legacy NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1049200 (https://phabricator.wikimedia.org/T319184) [14:22:38] (03CR) 10Arnaudb: [C:03+1] mariadb: enable pt-heartbeat monitoring through mysqld-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [14:23:12] (03PS2) 10Effie Mouzeli: Remove kubetcd* from etcd SRV records (eqiad+codfw) [dns] - 10https://gerrit.wikimedia.org/r/1051380 (https://phabricator.wikimedia.org/T353464) [14:23:50] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9945028 (10Volans) [14:25:13] (03CR) 10David Caro: "Tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909) (owner: 10David Caro) [14:25:17] (03CR) 10Ssingh: [C:03+2] cookbooks/sre/dns: add a cookbook for roll restart of ntpd.service [cookbooks] - 10https://gerrit.wikimedia.org/r/1049950 (owner: 10Ssingh) [14:26:06] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: nova-compute: remove support for legacy NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1049200 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [14:26:21] (03CR) 10Kamila Součková: [C:03+1] Remove kubetcd* from etcd SRV records (eqiad+codfw) [dns] - 10https://gerrit.wikimedia.org/r/1051380 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [14:26:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:26] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1008.eqiad.wmnet [14:29:28] (03CR) 10David Caro: [C:03+2] "Passing in tools too:" [puppet] - 10https://gerrit.wikimedia.org/r/1051375 (https://phabricator.wikimedia.org/T368909) (owner: 10David Caro) [14:30:04] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9945048 (10aborrero) 05Open→03Stalled marking as stalled, because the work on ceph nodes wont be progressing for a while. [14:32:53] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945091 (10CDanis) >>! In T348643#9931318, @dcaro wrote: > Any ideas/recommendations on how to proceed next? > > I... [14:34:04] (03CR) 10Effie Mouzeli: [C:03+2] Remove kubetcd* from etcd SRV records (eqiad+codfw) [dns] - 10https://gerrit.wikimedia.org/r/1051380 (https://phabricator.wikimedia.org/T353464) (owner: 10Effie Mouzeli) [14:35:09] (03PS3) 10Clément Goubert: P:kubernetes:node: Autorestart ferm.service [puppet] - 10https://gerrit.wikimedia.org/r/1051378 (https://phabricator.wikimedia.org/T354855) [14:35:49] (03PS1) 10Volans: admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) [14:36:44] (03CR) 10CI reject: [V:04-1] admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) (owner: 10Volans) [14:37:02] (03PS1) 10Elukey: knative: upgrade all images to Bullseye and Golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) [14:37:21] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051387 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [14:37:41] !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubetcd[1004-1006].eqiad.wmnet [14:38:08] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm [14:39:13] (03PS2) 10Volans: admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) [14:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:08] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-wikifunctions: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051368 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:41:30] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-misc: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051367 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:43:49] jouncebot: nowandnext [14:43:49] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [14:43:49] In 0 hour(s) and 16 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1500) [14:43:55] (03CR) 10Clément Goubert: [C:03+2] mw-misc: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051367 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:44:05] (03CR) 10Clément Goubert: [C:03+2] mw-wikifunctions: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051368 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:44:12] (03PS2) 10Arnaudb: mariadb: installs backport mysqld-exporter on deb11 [puppet] - 10https://gerrit.wikimedia.org/r/1051388 (https://phabricator.wikimedia.org/T367278) [14:45:02] (03Merged) 10jenkins-bot: mw-misc: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051367 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:45:04] (03Merged) 10jenkins-bot: mw-wikifunctions: deploy statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051368 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:45:09] (03CR) 10Arnaudb: [C:03+1] mariadb: enable pt-heartbeat monitoring through mysqld-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [14:45:33] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [14:46:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:47:50] !log jiji@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubetcd[2004-2006].codfw.wmnet [14:48:03] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1051348 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [14:48:12] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubetcd[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [14:49:33] (03CR) 10Ssingh: [C:03+1] admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) (owner: 10Volans) [14:49:48] (03CR) 10Ayounsi: [C:03+1] admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) (owner: 10Volans) [14:50:25] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [14:51:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubetcd[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [14:51:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:51:15] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubetcd[1004-1006].eqiad.wmnet [14:51:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9945244 (10Jhancock.wm) @cmooney got sretest2002 on lsw-d4, ports 44 and 45. 10G card. [14:52:09] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [14:52:20] PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:52:35] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [14:52:43] !log brouberol@cumin1002 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes [14:52:58] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [14:53:26] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [14:53:48] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [14:53:54] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [14:54:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9945268 (10Papaul) @elukey we will work on this more tomorrow during the meeting . Thanks [14:54:33] (03CR) 10Volans: [C:03+2] admin: add cwylo to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051385 (https://phabricator.wikimedia.org/T368027) (owner: 10Volans) [14:55:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65666 and previous config saved to /var/cache/conftool/dbconfig/20240702-145542-marostegui.json [14:55:45] !log upgrading A:cp-esams to haproxy 2.8.10 (T367756) [14:55:46] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:48] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [14:55:55] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams [14:55:57] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams [14:56:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:56:54] (03CR) 10Dzahn: "Wanna share what the actual error is? We had similar cases that turned out to be legit things that can be fixed." [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking) [14:57:54] (03CR) 10Filippo Giunchedi: "LGTM, though see inline for some considerations" [puppet] - 10https://gerrit.wikimedia.org/r/1051388 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [14:58:05] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage [14:58:14] !log jiji@cumin1002 START - Cookbook sre.dns.netbox [14:58:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9945280 (10cmooney) >>! In T367512#9945244, @Jhancock.wm wrote: > @cmooney got sretest2002 on lsw-d4, ports 44 and 45. 10G card. Awesome thank... [14:58:36] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [14:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:28] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9945273 (10Volans) 05In progress→03Resolved @cwylo this is now done, I'm resolving the task. Within 30 minutes the change should be... [14:59:42] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:00:05] eoghan, jelto, arnoldokoth, and mutante: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1500). [15:00:47] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945281 (10dcaro) >>! In T348643#9945091, @CDanis wrote: >>>! In T348643#9931318, @dcaro wrote: >> Any ideas/recomme... [15:00:48] 10ops-codfw, 06SRE, 06DC-Ops: Cabling for FR - https://phabricator.wikimedia.org/T368940#9945284 (10Papaul) [15:02:07] !log jiji@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubetcd[2004-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [15:02:38] 10ops-codfw, 06SRE, 06DC-Ops: Cabling for FR - https://phabricator.wikimedia.org/T368940#9945286 (10Papaul) [15:03:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage [15:05:07] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945304 (10CDanis) Yeah okay, that's all pretty messy to potentially clean up from. Have you tried the `ceph-syn` t... [15:05:54] !log jiji@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubetcd[2004-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jiji@cumin1002" [15:05:54] !log jiji@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:05:55] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubetcd[2004-2006].codfw.wmnet [15:06:25] RESOLVED: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:14] PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:09:16] (03CR) 10Kamila Součková: [C:03+1] P:kubernetes:node: Autorestart ferm.service [puppet] - 10https://gerrit.wikimedia.org/r/1051378 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [15:09:31] (03CR) 10Elukey: [C:03+2] services: update thumbor-plugin Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051150 (owner: 10Elukey) [15:10:11] (03CR) 10Clément Goubert: [C:03+2] P:kubernetes:node: Autorestart ferm.service [puppet] - 10https://gerrit.wikimedia.org/r/1051378 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [15:10:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P65667 and previous config saved to /var/cache/conftool/dbconfig/20240702-151050-marostegui.json [15:11:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:12] (03PS1) 10Alexandros Kosiaris: deploy1003: Comment them out from scap_masters [puppet] - 10https://gerrit.wikimedia.org/r/1051392 (https://phabricator.wikimedia.org/T364417) [15:12:22] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9945320 (10cmooney) All seems ok following the increase: {F56173453 width=500} FWIW the scraping is now taking longer, indicating that... [15:12:47] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: core: remove deprecated pre-defined kubelet flags [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) [15:12:49] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:12:54] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:13:09] (03CR) 10CI reject: [V:04-1] kubeadm: core: remove deprecated pre-defined kubelet flags [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) (owner: 10Arturo Borrero Gonzalez) [15:13:26] (03CR) 10Ahmon Dancy: [C:03+1] deploy1003: Comment them out from scap_masters [puppet] - 10https://gerrit.wikimedia.org/r/1051392 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [15:13:48] (03CR) 10Clément Goubert: [C:03+2] mw-web: Add traindev environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047158 (owner: 10Ahmon Dancy) [15:14:42] (03Merged) 10jenkins-bot: mw-web: Add traindev environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047158 (owner: 10Ahmon Dancy) [15:15:25] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: core: remove deprecated pre-defined kubelet flags [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) [15:16:16] (03CR) 10Alexandros Kosiaris: [C:03+2] deploy1003: Comment them out from scap_masters [puppet] - 10https://gerrit.wikimedia.org/r/1051392 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [15:16:18] (03PS3) 10Elukey: docker::reporter: update exclude rules [puppet] - 10https://gerrit.wikimedia.org/r/1051379 (https://phabricator.wikimedia.org/T367427) [15:16:25] RESOLVED: [2x] SystemdUnitFailed: ferm.service on wikikube-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:54] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) (owner: 10Arturo Borrero Gonzalez) [15:17:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2007.codfw.wmnet with OS bookworm [15:22:20] RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:23:20] (03PS1) 10Fabfur: hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051394 (https://phabricator.wikimedia.org/T367756) [15:24:08] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for - https://phabricator.wikimedia.org/T368566#9945385 (10Sharvaniharan) Thank you @Ottomata and @Dzahn Should I be doing anything to get the analytics-privatedata-users access, or is this task sufficient? [15:24:28] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) (owner: 10Arturo Borrero Gonzalez) [15:24:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051394 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [15:25:56] (03PS20) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [15:25:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P65668 and previous config saved to /var/cache/conftool/dbconfig/20240702-152558-marostegui.json [15:26:10] (03CR) 10Gergő Tisza: Handle sso.wikimedia.org domain (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [15:30:25] (03PS1) 10Fabfur: hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051395 (https://phabricator.wikimedia.org/T367756) [15:30:48] (03CR) 10CI reject: [V:04-1] hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051395 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [15:32:06] (03CR) 10Vgutierrez: [C:04-1] "current approach requires LVS to be rebooted to be applied, some exec stanzas would be needed to enforce the change on run-puppet-agent ti" [puppet] - 10https://gerrit.wikimedia.org/r/1006063 (https://phabricator.wikimedia.org/T358260) (owner: 10Cathal Mooney) [15:33:18] (03CR) 10Vgutierrez: [C:03+1] hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051394 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [15:33:48] (03PS1) 10Jdlrobson: Make Flow work in dark mode by disabling backgrounds and setting text [extensions/Flow] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051396 (https://phabricator.wikimedia.org/T357600) [15:35:02] (03PS7) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T357575) [15:35:11] (03PS8) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T357575) [15:35:34] (03PS9) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366368) [15:35:58] (03PS1) 10Fabfur: hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051397 (https://phabricator.wikimedia.org/T367756) [15:36:02] (03PS10) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366368) [15:36:22] (03CR) 10CI reject: [V:04-1] hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051397 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [15:38:14] RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:38:17] (03Abandoned) 10Fabfur: hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051395 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [15:38:23] (03PS1) 10Brouberol: Superset: upgrade Superset to version 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051398 (https://phabricator.wikimedia.org/T366060) [15:38:34] (03CR) 10Fabfur: [C:03+2] hiera: consolidate haproxy version to 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/1051394 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [15:39:55] 06SRE, 10Thumbor: wikipedia-pl-sysop: local images fail to generate thumbnail - https://phabricator.wikimedia.org/T368945#9945465 (10Volans) [15:40:04] (03PS3) 10Arturo Borrero Gonzalez: kubeadm: core: remove deprecated pre-defined kubelet flags [puppet] - 10https://gerrit.wikimedia.org/r/1051393 (https://phabricator.wikimedia.org/T355881) [15:41:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65669 and previous config saved to /var/cache/conftool/dbconfig/20240702-154105-marostegui.json [15:41:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [15:41:13] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:41:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [15:41:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T364069)', diff saved to https://phabricator.wikimedia.org/P65670 and previous config saved to /var/cache/conftool/dbconfig/20240702-154127-marostegui.json [15:42:33] FIRING: KubernetesCalicoDown: kubernetes1051.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=kubernetes1051.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:43:18] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2007.codfw.wmnet with OS bookworm [15:44:43] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 20:00:00 on kubernetes1051.eqiad.wmnet with reason: Hardware issue [15:44:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9945483 (10Clement_Goubert) Host is flapping, setting downtime until tomorrow [15:44:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on kubernetes1051.eqiad.wmnet with reason: Hardware issue [15:45:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9945484 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d5196ee-59a9-4e12-b2fc-c8c25de6ab16) set by cgoubert@cumin1002... [15:45:31] (03PS1) 10Elukey: wmfdebug: Upgrade to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051402 (https://phabricator.wikimedia.org/T368366) [15:45:43] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051402 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [15:46:06] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams [15:47:29] (03CR) 10Btullis: [C:03+1] "Nice. Remember that we still have to do a manual `superset db migrate` and a `superset init` once the new version is deployed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051398 (https://phabricator.wikimedia.org/T366060) (owner: 10Brouberol) [15:48:17] (03CR) 10Brouberol: [C:03+2] Superset: upgrade Superset to version 4.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051398 (https://phabricator.wikimedia.org/T366060) (owner: 10Brouberol) [15:49:02] (03CR) 10DCausse: [C:03+1] Search update pipeline: reduce client-side rate-limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051358 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [15:49:12] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams [15:50:00] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [15:50:28] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [15:51:02] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for - https://phabricator.wikimedia.org/T368566#9945499 (10Volans) @Sharvaniharan I'll re-purpose this task for the revised requirement, I'll let you know if any data is missing [15:51:32] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-master1004.eqiad.wmnet [15:52:18] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9945514 (10Scott_French) Thanks for the sample data, @xcollazo. Using the fir... [15:55:11] (03CR) 10Vgutierrez: [C:04-1] "tests aren't happy here:" [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [15:55:56] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9945529 (10Volans) p:05High→03Medium [15:56:35] (03CR) 10Vgutierrez: [C:04-1] varnish: make donate.m redirect permanent and add tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [15:57:57] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage [15:58:05] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1004.eqiad.wmnet [16:00:04] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1600). [16:00:04] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:11] (03PS1) 10Alexandros Kosiaris: Revert "Resurrect fluent-bit image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051407 (https://phabricator.wikimedia.org/T251812) [16:00:15] o/ [16:00:16] tgr|away: hi I was just looking at this :) [16:00:27] it is more complex than I am comfortable deploying in the puppet window, I think [16:00:43] but let me see if I can find a domain expert who's willing to shepherd it through for you [16:01:19] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_drmrs [16:01:58] thanks rzl [16:02:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2007.codfw.wmnet with reason: host reimage [16:03:01] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9945568 (10Volans) @Sharvaniharan please confirm to have read [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities | Analytics Data Access User R... [16:03:24] (03PS2) 10Ssingh: varnish: make donate.m redirect permanent and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) [16:03:30] (03CR) 10Ssingh: "Thanks, updated!" [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [16:04:42] (03PS5) 10Ayounsi: DHCP: send subnet-mask 255.255.255.255 for routed ganeti VMs [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) [16:04:42] (03CR) 10Ayounsi: [V:03+1] "PCC is happy and the change has been tested with vmtest2007" [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:05:50] (03CR) 10Vgutierrez: [C:04-1] varnish: make donate.m redirect permanent and add tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [16:05:51] (03PS7) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) [16:06:40] (03PS2) 10Alexandros Kosiaris: Revert "Resurrect fluent-bit image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051407 (https://phabricator.wikimedia.org/T251812) [16:07:30] (03PS1) 10CDanis: CHANGELOG for configuration 1.8.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051412 (https://phabricator.wikimedia.org/T362310) [16:09:25] (03PS1) 10Volans: admin: add sharvaniharan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1051413 (https://phabricator.wikimedia.org/T368566) [16:09:45] (03CR) 10Volans: [C:04-1] "Pending approval on task" [puppet] - 10https://gerrit.wikimedia.org/r/1051413 (https://phabricator.wikimedia.org/T368566) (owner: 10Volans) [16:13:29] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_drmrs [16:13:57] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: Add buffer [puppet] - 10https://gerrit.wikimedia.org/r/1051415 (https://phabricator.wikimedia.org/T367076) [16:13:58] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945668 (10dcaro) >>! In T348643#9945304, @CDanis wrote: > Yeah okay, that's all pretty messy to potentially clean u... [16:15:50] (03CR) 10Vgutierrez: [C:04-1] varnish: make donate.m redirect permanent and add tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [16:16:14] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2007.codfw.wmnet with OS bookworm [16:17:39] (03PS3) 10Ssingh: varnish: make donate.m redirect permanent and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) [16:17:50] (03CR) 10Ssingh: varnish: make donate.m redirect permanent and add tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [16:18:29] (03PS1) 10Dzahn: gerrit: remove NRPE process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1032526 [16:18:29] (03CR) 10Dzahn: "@hashar - I'll just abandon this and keep it but I am still interested in the answer to that previous question. Do you have _actual_ mail " [puppet] - 10https://gerrit.wikimedia.org/r/1032526 (owner: 10Dzahn) [16:20:15] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus) [16:20:49] (03CR) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [16:20:56] 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9945722 (10Dzahn) [16:21:13] (03CR) 10Dzahn: [C:04-1] "@volans thanks for handling the access requests so nicely this week. would you mind taking a look at this one too?" [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [16:22:55] (03CR) 10Dzahn: [C:04-1] "The thing here is that the owners of the machine are data-engineering but research team uses them. So based on that, who should be the act" [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [16:27:11] (03CR) 10Andrew Bogott: [C:03+2] Toolforge elasticsearch haproxy: update CORS syntax for modern haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1051216 (https://phabricator.wikimedia.org/T311905) (owner: 10Andrew Bogott) [16:27:11] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:27:29] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [16:27:58] (03CR) 10Ssingh: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [16:28:01] (03CR) 10Ssingh: [C:03+2] varnish: make donate.m redirect permanent and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1051370 (https://phabricator.wikimedia.org/T368645) (owner: 10Ssingh) [16:28:01] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:32:14] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945806 (10CDanis) Ah okay sorry. Maybe experiment with running `rados bench` and slowly increasing the number of n... [16:34:49] (03CR) 10Vgutierrez: [C:04-1] varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [16:38:28] (03PS4) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) [16:38:38] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9945824 (10xcollazo) @Scott_French : One odd thing I notice is that, even thou... [16:38:50] (03CR) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [16:44:10] (03PS1) 10Jforrester: Update OOUI to v0.50.3 [vendor] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051416 [16:44:26] (03PS1) 10Jforrester: Update OOUI to v0.50.3 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051417 (https://phabricator.wikimedia.org/T369010) [16:45:09] Okie-dokie, train-blocker ahoy. [16:45:14] jouncebot: nowandnext [16:45:14] For the next 0 hour(s) and 14 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1600) [16:45:14] In 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1700) [16:45:26] Hmm. Let's see if we can land this swiftly. [16:46:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [vendor] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051416 (owner: 10Jforrester) [16:46:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051417 (https://phabricator.wikimedia.org/T369010) (owner: 10Jforrester) [16:48:06] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9945840 (10dcaro) >>! In T348643#9945806, @CDanis wrote: > Ah okay sorry. Maybe experiment with running `rados benc... [17:00:00] (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051420 (https://phabricator.wikimedia.org/T344471) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1700) [17:00:55] (03CR) 10Hashar: [C:03+1] "Feel free to have this deployed at anytime ;)" [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051417 (https://phabricator.wikimedia.org/T369010) (owner: 10Jforrester) [17:01:08] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9945867 (10Scott_French) Thanks for taking a look, @xcollazo. I'll defer to @m... [17:01:33] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051420 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [17:02:11] 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080 (10colewhite) 03NEW [17:02:28] 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080#9945883 (10colewhite) p:05Triage→03High [17:02:33] (03PS11) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366366) [17:02:39] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051420 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [17:06:25] !log lists1004 - sudo systemctl start wmf_auto_restart_exim4 (T369017) [17:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:27] T369017: SystemdUnitFailed - lists1004 - wmf_auto_restart_exim4 - https://phabricator.wikimedia.org/T369017 [17:06:33] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [17:06:50] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:06:51] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:07:11] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:07:12] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:07:39] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:09:44] (03Merged) 10jenkins-bot: Update OOUI to v0.50.3 [vendor] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051416 (owner: 10Jforrester) [17:09:47] (03Merged) 10jenkins-bot: Update OOUI to v0.50.3 [core] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1051417 (https://phabricator.wikimedia.org/T369010) (owner: 10Jforrester) [17:10:21] !log jforrester@deploy1002 Started scap sync-world: Backport for [[gerrit:1051416|Update OOUI to v0.50.3]], [[gerrit:1051417|Update OOUI to v0.50.3 (T369010)]] [17:10:24] T369010: Language dropdown on Special:NewItem is broken on Beta Wikidata - https://phabricator.wikimedia.org/T369010 [17:11:51] (03CR) 10Vgutierrez: [C:04-1] varnish: Copy value of X-Wikimedia-Debug cookie to header (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [17:14:37] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1051416|Update OOUI to v0.50.3]], [[gerrit:1051417|Update OOUI to v0.50.3 (T369010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:15:13] !log jforrester@deploy1002 jforrester: Continuing with sync [17:17:49] (03PS1) 10Cwhite: admin: add new ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1051421 [17:19:02] (03PS1) 10Cwhite: admin: remove old ssh key for cwhite [puppet] - 10https://gerrit.wikimedia.org/r/1051422 [17:20:27] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1051416|Update OOUI to v0.50.3]], [[gerrit:1051417|Update OOUI to v0.50.3 (T369010)]] (duration: 10m 06s) [17:20:30] T369010: Language dropdown on Special:NewItem is broken on Beta Wikidata - https://phabricator.wikimedia.org/T369010 [17:22:42] (03CR) 10Bking: "I think it's just hitting timeouts. If you go back in Grafana (https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=p" [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking) [17:24:26] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [17:34:33] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [17:34:38] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [17:36:19] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [17:36:21] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [17:38:12] (03PS60) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [17:39:21] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [17:39:23] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [17:40:25] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [17:40:42] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [17:42:32] (03PS1) 10Cwhite: logstash: add curator delete job for ecs-k8s indices [puppet] - 10https://gerrit.wikimedia.org/r/1051427 (https://phabricator.wikimedia.org/T368186) [17:51:56] (03CR) 10Herron: [C:03+1] logstash: route thumbor logs in routing filter [puppet] - 10https://gerrit.wikimedia.org/r/1051214 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite) [17:52:18] (03CR) 10Dzahn: "For now just let me add this: I can help with solving the "route alerts per team". It's possible. We have done this for gerrit checks by c" [puppet] - 10https://gerrit.wikimedia.org/r/1051369 (https://phabricator.wikimedia.org/T366405) (owner: 10Bking) [17:52:47] (03CR) 10Herron: [C:03+1] logstash: add curator delete job for ecs-k8s indices [puppet] - 10https://gerrit.wikimedia.org/r/1051427 (https://phabricator.wikimedia.org/T368186) (owner: 10Cwhite) [17:57:18] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9946355 (10xcollazo) >>! In T368098#9944101, @ABran-WMF wrote: > [[ https://wm... [17:59:43] (03CR) 10Krinkle: varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [18:00:05] hashar and jeena: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T1800) [18:08:30] FIRING: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:10:50] (03PS1) 10Giuseppe Lavagetto: base/statsd: add 1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051428 [18:10:50] (03PS1) 10Giuseppe Lavagetto: statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 [18:10:51] (03PS1) 10Giuseppe Lavagetto: statsd-exporter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051430 [18:11:22] (03CR) 10Jdlrobson: [C:04-1] "Note: we want to re-evaulate tier 1 and 2 before deploying this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050083 (https://phabricator.wikimedia.org/T367150) (owner: 10Jdlrobson) [18:13:30] RESOLVED: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292) (owner: 10Arlolra) [18:15:57] 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080#9946469 (10Joe) https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1051428 and followups should fix the issue [18:16:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720) (owner: 10Arlolra) [18:25:52] (03CR) 10Vgutierrez: [C:04-1] varnish: Copy value of X-Wikimedia-Debug cookie to header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [18:28:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9946538 (10cmooney) @Jhancock.wm can you confirm what position in the rack the server is in? I assumed based on the first port it's in U45 so I... [18:38:51] (03PS1) 10Ahmon Dancy: DevServices.php: Add excimer-ui-url/excimer-ui-server placeholders [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051431 [18:41:13] (03CR) 10Ahmon Dancy: [C:03+2] DevServices.php: Add excimer-ui-url/excimer-ui-server placeholders [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051431 (owner: 10Ahmon Dancy) [18:41:51] (03Merged) 10jenkins-bot: DevServices.php: Add excimer-ui-url/excimer-ui-server placeholders [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051431 (owner: 10Ahmon Dancy) [18:49:04] (03PS1) 10JHathaway: postfix: add wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/1051432 (https://phabricator.wikimedia.org/T325406) [18:50:09] (03PS1) 10JHathaway: postfix: fix use param of $trusted_networks [puppet] - 10https://gerrit.wikimedia.org/r/1051433 (https://phabricator.wikimedia.org/T325406) [18:51:17] (03PS1) 10JHathaway: postfix: override default for parent_domain_matches_subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1051434 (https://phabricator.wikimedia.org/T325406) [18:51:20] (03PS1) 10JHathaway: postfix: verify recipients when possible [puppet] - 10https://gerrit.wikimedia.org/r/1051435 (https://phabricator.wikimedia.org/T325406) [18:51:43] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051432 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:52:33] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051433 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:52:36] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051434 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:52:38] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051435 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:54:39] (03CR) 10JHathaway: [C:03+2] postfix: add wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/1051432 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:54:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T364069)', diff saved to https://phabricator.wikimedia.org/P65671 and previous config saved to /var/cache/conftool/dbconfig/20240702-185443-marostegui.json [18:54:46] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [18:55:00] (03CR) 10JHathaway: [C:03+2] postfix: fix use param of $trusted_networks [puppet] - 10https://gerrit.wikimedia.org/r/1051433 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [19:07:17] (03CR) 10JHathaway: [C:03+2] postfix: override default for parent_domain_matches_subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1051434 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [19:07:24] (03CR) 10JHathaway: [C:03+2] postfix: verify recipients when possible [puppet] - 10https://gerrit.wikimedia.org/r/1051435 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [19:08:01] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1051439 [19:09:02] 06SRE, 06Infrastructure-Foundations, 10netops: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106 (10cmooney) 03NEW p:05Triage→03Medium [19:09:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P65672 and previous config saved to /var/cache/conftool/dbconfig/20240702-190950-marostegui.json [19:10:45] 06SRE, 06Infrastructure-Foundations, 10netops: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#9946735 (10cmooney) [19:10:46] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9946736 (10cmooney) [19:10:51] (03PS1) 10Cathal Mooney: Add net data for codfw new per-rack subnets and add switches to rancid [puppet] - 10https://gerrit.wikimedia.org/r/1051440 (https://phabricator.wikimedia.org/T369106) [19:16:27] (03PS2) 10Cathal Mooney: Add net data for codfw new per-rack subnets and add switches to rancid [puppet] - 10https://gerrit.wikimedia.org/r/1051440 (https://phabricator.wikimedia.org/T369106) [19:19:15] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:21:08] (03PS5) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) [19:24:32] (03PS6) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) [19:24:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P65673 and previous config saved to /var/cache/conftool/dbconfig/20240702-192457-marostegui.json [19:25:05] (03CR) 10Gergő Tisza: varnish: Copy value of X-Wikimedia-Debug cookie to header (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [19:27:33] (03CR) 10Dduvall: [C:03+1] "Looks right!" [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy) [19:35:38] (03PS1) 10Andrew Bogott: profile::toolforge::elasticsearch::keepalived: keepalived interface from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1051444 (https://phabricator.wikimedia.org/T311905) [19:36:47] (03PS1) 10JHathaway: temporarily add mx-in1001 as an MX server, test #2 [dns] - 10https://gerrit.wikimedia.org/r/1051445 (https://phabricator.wikimedia.org/T367517) [19:36:56] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9946877 (10mforns) @Scott_French Would it be possible for us to make a last ho... [19:37:37] (03CR) 10CI reject: [V:04-1] temporarily add mx-in1001 as an MX server, test #2 [dns] - 10https://gerrit.wikimedia.org/r/1051445 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [19:40:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T364069)', diff saved to https://phabricator.wikimedia.org/P65674 and previous config saved to /var/cache/conftool/dbconfig/20240702-194005-marostegui.json [19:40:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [19:40:08] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [19:40:11] (03PS2) 10JHathaway: temporarily add mx-in1001 as an MX server, test #2 [dns] - 10https://gerrit.wikimedia.org/r/1051445 (https://phabricator.wikimedia.org/T367517) [19:40:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [19:40:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T364069)', diff saved to https://phabricator.wikimedia.org/P65675 and previous config saved to /var/cache/conftool/dbconfig/20240702-194027-marostegui.json [19:41:26] (03PS5) 10Herron: pyrra: add liftwing SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) [19:41:27] (03CR) 10Herron: [V:03+1] "Hey Luca, thinking about revisiting this to see how it performs now. What do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:41:38] (03CR) 10Andrew Bogott: [C:03+2] profile::toolforge::elasticsearch::keepalived: keepalived interface from hiera [puppet] - 10https://gerrit.wikimedia.org/r/1051444 (https://phabricator.wikimedia.org/T311905) (owner: 10Andrew Bogott) [19:42:37] (03PS1) 10Ryan Kemper: [WIP] wdqs graph split: new A and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) [19:43:49] (03CR) 10JHathaway: [C:03+2] temporarily add mx-in1001 as an MX server, test #2 [dns] - 10https://gerrit.wikimedia.org/r/1051445 (https://phabricator.wikimedia.org/T367517) (owner: 10JHathaway) [19:45:31] !log running another email inbound mx test on mx-in1001 [19:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:47] (03PS1) 10Btullis: cephcsi: bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051447 (https://phabricator.wikimedia.org/T327259) [19:59:56] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9947026 (10Scott_French) @mforns sure, that's no problem at all! Just let me k... [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240702T2000). [20:00:04] kimberly_sarabia and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:27] hello! [20:00:33] i can deploy today [20:00:35] hello kimberly_sarabia [20:01:09] arlolra: around? [20:01:36] (03CR) 10Urbanecm: [C:03+2] [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson) [20:02:13] (03Merged) 10jenkins-bot: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson) [20:02:35] urbanecm: yes, around [20:03:01] (03PS2) 10Arlolra: Remove unused Linter configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292) [20:03:04] (03CR) 10Urbanecm: [C:03+2] Remove unused Linter configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292) (owner: 10Arlolra) [20:03:06] yay! :) [20:03:43] (03Merged) 10jenkins-bot: Remove unused Linter configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292) (owner: 10Arlolra) [20:04:16] (03CR) 10Btullis: [C:03+2] cephcsi: bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051447 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [20:04:34] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1050085|[July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis (T367151)]], [[gerrit:1048138|Remove unused Linter configs (T343292)]] [20:04:39] T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin - https://phabricator.wikimedia.org/T367151 [20:04:41] T343292: Deprecate and then remove Linter config variables used to control new linter table field access - https://phabricator.wikimedia.org/T343292 [20:07:19] (03Merged) 10jenkins-bot: cephcsi: bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051447 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [20:07:21] !log urbanecm@deploy1002 jdlrobson, arlolra, urbanecm: Backport for [[gerrit:1050085|[July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis (T367151)]], [[gerrit:1048138|Remove unused Linter configs (T343292)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:07:29] kimberly_sarabia: please test at mwdebug [20:07:41] arlolra: your first patch is at debug as well, but looks like tzhere might not be anything to test? [20:07:55] ok i need a couple minutes. have to look at several wikis [20:09:13] sure [20:09:15] urbanecm: thanks, I'll just verify linting is still working [20:09:22] arlolra: ack, will wait on you [20:11:21] (03PS1) 10CDanis: copy patch [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051449 (https://phabricator.wikimedia.org/T363407) [20:11:23] (03PS1) 10CDanis: mesh: use namespace for default service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051450 (https://phabricator.wikimedia.org/T363407) [20:11:48] (03CR) 10Cathal Mooney: [C:03+2] Add net data for codfw new per-rack subnets and add switches to rancid [puppet] - 10https://gerrit.wikimedia.org/r/1051440 (https://phabricator.wikimedia.org/T369106) (owner: 10Cathal Mooney) [20:12:18] urbanecm: LGTM! [20:12:22] (03PS2) 10CDanis: mesh: use namespace for default service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051450 (https://phabricator.wikimedia.org/T363407) [20:12:23] ack, ty! [20:12:25] waiting on arlolra [20:14:28] Please go ahead [20:15:22] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [20:15:46] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [20:15:59] proceeding [20:16:01] !log urbanecm@deploy1002 jdlrobson, arlolra, urbanecm: Continuing with sync [20:16:06] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add new elements to automation to support new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#9947129 (10cmooney) [20:16:36] (03PS2) 10Arlolra: Follow the defaults for Parsoid on MFE on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720) [20:16:40] (03CR) 10Urbanecm: [C:03+2] Follow the defaults for Parsoid on MFE on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720) (owner: 10Arlolra) [20:17:14] (03Merged) 10jenkins-bot: Follow the defaults for Parsoid on MFE on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720) (owner: 10Arlolra) [20:21:06] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1050085|[July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis (T367151)]], [[gerrit:1048138|Remove unused Linter configs (T343292)]] (duration: 16m 31s) [20:21:11] T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin - https://phabricator.wikimedia.org/T367151 [20:21:12] T343292: Deprecate and then remove Linter config variables used to control new linter table field access - https://phabricator.wikimedia.org/T343292 [20:21:20] first one done [20:21:47] !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1048144|Follow the defaults for Parsoid on MFE on officewiki (T363720)]] [20:21:50] T363720: Provide ParserMigration option to exclude mobile frontend - https://phabricator.wikimedia.org/T363720 [20:24:34] !log urbanecm@deploy1002 arlolra, urbanecm: Backport for [[gerrit:1048144|Follow the defaults for Parsoid on MFE on officewiki (T363720)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:25:07] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host wikikube-ctrl2002.codfw.wmnet [20:25:52] arlolra: please take a look at the second patch please [20:26:00] Will do [20:27:48] Ok, working as expected [20:28:16] !log urbanecm@deploy1002 arlolra, urbanecm: Continuing with sync [20:28:19] proceeding, thanks [20:30:15] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 15), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9947164 (10xcollazo) `20240701` run update: Most all wikis are now done with... [20:31:53] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:33:32] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1048144|Follow the defaults for Parsoid on MFE on officewiki (T363720)]] (duration: 11m 44s) [20:33:34] T363720: Provide ParserMigration option to exclude mobile frontend - https://phabricator.wikimedia.org/T363720 [20:33:44] arlolra: and, done [20:33:46] anything else? [20:33:54] All good. Thanks so much [20:33:58] any time! [20:34:18] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002" [20:35:12] (03PS1) 10JHathaway: Revert "temporarily add mx-in1001 as an MX server, test #2" [dns] - 10https://gerrit.wikimedia.org/r/1051452 [20:35:20] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002" [20:35:20] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:36:50] (03CR) 10JHathaway: [C:03+2] Revert "temporarily add mx-in1001 as an MX server, test #2" [dns] - 10https://gerrit.wikimedia.org/r/1051452 (owner: 10JHathaway) [20:37:56] (03PS1) 10CDanis: Bump mediawiki chart version & mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051453 (https://phabricator.wikimedia.org/T363407) [20:39:09] !log cmooney@cumin1002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [20:39:40] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051454 [20:40:38] thanks urbanecm [20:40:45] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051454 (owner: 10Ahmon Dancy) [20:41:23] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051454 (owner: 10Ahmon Dancy) [20:42:37] (03PS1) 10Ahmon Dancy: DevServices.php: Set ipoid placeholder [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051456 [20:42:52] (03CR) 10Ahmon Dancy: [C:03+2] DevServices.php: Set ipoid placeholder [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051456 (owner: 10Ahmon Dancy) [20:43:34] (03Merged) 10jenkins-bot: DevServices.php: Set ipoid placeholder [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1051456 (owner: 10Ahmon Dancy) [20:44:15] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:45:49] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [20:45:52] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:48:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9947208 (10cmooney) Also @Jhancock.wm when next on site can you check the mgmt / idrac connection for this one? It doesn't seem to be trying to... [20:49:27] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002" [20:50:15] PROBLEM - Postfix SMTP on mx-in1001 is CRITICAL: connect to address 208.80.155.102 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [20:50:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add sretest2002 entries - cmooney@cumin1002" [20:50:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:52:11] (03PS6) 10Ayounsi: DHCP: send subnet-mask 255.255.255.255 for routed ganeti VMs [puppet] - 10https://gerrit.wikimedia.org/r/1051366 (https://phabricator.wikimedia.org/T300152) [20:52:11] (03PS1) 10Ayounsi: Routed Ganeti: add public v4 tap_ip [puppet] - 10https://gerrit.wikimedia.org/r/1051458 (https://phabricator.wikimedia.org/T362330) [20:52:26] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051458 (https://phabricator.wikimedia.org/T362330) (owner: 10Ayounsi) [20:53:15] RECOVERY - Postfix SMTP on mx-in1001 is OK: OK - Certificate mx-in1001.wikimedia.org will expire on Wed 11 Sep 2024 07:47:45 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [20:59:02] (03PS2) 10RLazarus: mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966) [20:59:20] (03PS4) 10RLazarus: deployment_server: Add --follow, --attach to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) [21:00:39] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9947237 (10jhathaway) [21:01:13] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9947239 (10jhathaway) [21:01:24] (03CR) 10RLazarus: [C:03+2] mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus) [21:03:11] (03Merged) 10jenkins-bot: mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus) [21:03:42] (03CR) 10RLazarus: [C:03+2] "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus) [21:06:44] 06SRE, 06collaboration-services, 06DBA, 13Patch-For-Review: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9947250 (10eoghan) a:05eoghan→03Ladsgroup Spoken with @Ladsgroup , I think there's nothing immediate for sre-collab to do here so reassigning. Feel free to send it back to m... [21:10:19] (03CR) 10RLazarus: [C:03+2] base/statsd: add 1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051428 (owner: 10Giuseppe Lavagetto) [21:11:12] (03Merged) 10jenkins-bot: base/statsd: add 1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051428 (owner: 10Giuseppe Lavagetto) [21:11:40] (03PS2) 10RLazarus: statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto) [21:11:48] (03CR) 10CI reject: [V:04-1] statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto) [21:12:45] (03PS3) 10RLazarus: statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto) [21:15:17] (03CR) 10RLazarus: [C:03+2] statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto) [21:16:06] (03Merged) 10jenkins-bot: statsd: re-add default args [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051429 (https://phabricator.wikimedia.org/T369080) (owner: 10Giuseppe Lavagetto) [21:17:59] (03CR) 10RLazarus: [C:03+2] statsd-exporter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051430 (owner: 10Giuseppe Lavagetto) [21:28:05] (03CR) 10RLazarus: statsd-exporter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051430 (owner: 10Giuseppe Lavagetto) [21:28:09] (03CR) 10RLazarus: [C:03+2] statsd-exporter: update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051430 (owner: 10Giuseppe Lavagetto) [21:30:40] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - db1181 - https://phabricator.wikimedia.org/T368648#9947337 (10Jhancock.wm) a:03VRiley-WMF [21:30:47] (03PS2) 10CDanis: Bump mediawiki chart version & mesh version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051453 (https://phabricator.wikimedia.org/T363407) [21:30:47] (03PS1) 10CDanis: DO NOT SUBMIT, testing mesh change against mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051466 [21:31:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9947342 (10Jhancock.wm) a:03VRiley-WMF [21:35:53] PROBLEM - Disk space on restbase2023 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 113989 MB (6% inode=99%): /srv/sdc4 69102 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2023&var-datasource=codfw+prometheus/ops [21:44:02] jouncebot: nowandnext [21:44:02] No deployments scheduled for the next 8 hour(s) and 15 minute(s) [21:44:02] In 8 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240703T0600) [21:44:38] doing a quick helmfile-only MW deploy for T369080 [21:44:38] T369080: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080 [21:51:04] !log rzl@deploy1002 Started scap sync-world: T369080 [21:51:07] T369080: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080 [21:52:47] !log rzl@deploy1002 rzl: T369080 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:54:11] !log rzl@deploy1002 rzl: Continuing with sync [21:54:43] !log rzl@deploy1002 Finished scap: T369080 (duration: 04m 13s) [21:55:59] ah, I missed the recent changes to the statsd-exporter deployment -- I see scap doesn't touch it, deploying it manually with helmfile now [21:56:22] just when I finally get used to "never run helmfile across all mw deployments, use scap instead" :) [21:57:14] (03PS2) 10Wargo: Namespace and import configuration for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051467 [21:57:49] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:58:16] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:58:17] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:58:27] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [22:01:39] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [22:01:55] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [22:01:56] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [22:02:09] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [22:02:10] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [22:02:25] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [22:02:26] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [22:02:40] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [22:02:42] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [22:02:55] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [22:02:56] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [22:03:10] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [22:03:11] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [22:03:14] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [22:03:15] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [22:03:17] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [22:03:18] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [22:03:30] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [22:03:31] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [22:03:40] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [22:03:41] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [22:03:43] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [22:03:44] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [22:03:46] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [22:03:48] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [22:04:08] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [22:04:09] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [22:04:21] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [22:04:24] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [22:04:38] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [22:04:39] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [22:04:50] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [22:04:51] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [22:05:00] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [22:05:01] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [22:05:08] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [22:05:36] (03PS12) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366366) [22:05:44] (03PS4) 10Jdlrobson: [July 15th] Deploy dark mode to all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050082 (https://phabricator.wikimedia.org/T368795) [22:05:49] done deploying [22:08:41] 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080#9947463 (10RLazarus) Disregard the above scap, I got too carried away with "never run helmfile across all mw deployments, use scap instead" but obviously that ru... [22:13:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T364069)', diff saved to https://phabricator.wikimedia.org/P65676 and previous config saved to /var/cache/conftool/dbconfig/20240702-221312-marostegui.json [22:13:20] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:21:06] (03PS2) 10Wargo: Set logo and favicon for sysop_plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051469 (https://phabricator.wikimedia.org/T368712) [22:25:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051469 (https://phabricator.wikimedia.org/T368712) (owner: 10Wargo) [22:25:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051467 (owner: 10Wargo) [22:28:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P65677 and previous config saved to /var/cache/conftool/dbconfig/20240702-222820-marostegui.json [22:43:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P65678 and previous config saved to /var/cache/conftool/dbconfig/20240702-224328-marostegui.json [22:58:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T364069)', diff saved to https://phabricator.wikimedia.org/P65679 and previous config saved to /var/cache/conftool/dbconfig/20240702-225835-marostegui.json [22:58:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance [22:58:39] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:58:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance [23:10:13] 06SRE, 10Observability-Metrics: statsd-exporter in k8s is not configured to use its mapping configuration - https://phabricator.wikimedia.org/T369080#9947536 (10colewhite) Thank you @RLazarus! @dcausse, I see some metrics now at `mediawiki_cirrus_search_request_time_bucket`. Anything amiss? [23:19:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T367856)', diff saved to https://phabricator.wikimedia.org/P65680 and previous config saved to /var/cache/conftool/dbconfig/20240702-231945-marostegui.json [23:19:49] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:34:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P65681 and previous config saved to /var/cache/conftool/dbconfig/20240702-233452-marostegui.json [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051486 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051486 (owner: 10TrainBranchBot) [23:50:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P65682 and previous config saved to /var/cache/conftool/dbconfig/20240702-234959-marostegui.json [23:51:45] (03CR) 10Eccenux: "Seems like it needs yaml update too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051469 (https://phabricator.wikimedia.org/T368712) (owner: 10Wargo) [23:54:59] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1051427 (https://phabricator.wikimedia.org/T368186) (owner: 10Cwhite) [23:55:21] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1051214 (https://phabricator.wikimedia.org/T368180) (owner: 10Cwhite)