[00:06:25] RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T413525)', diff saved to https://phabricator.wikimedia.org/P87578 and previous config saved to /var/cache/conftool/dbconfig/20260116-001027-marostegui.json [00:10:33] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [00:14:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:14:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T413525)', diff saved to https://phabricator.wikimedia.org/P87579 and previous config saved to /var/cache/conftool/dbconfig/20260116-001449-marostegui.json [00:20:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P87580 and previous config saved to /var/cache/conftool/dbconfig/20260116-002036-marostegui.json [00:24:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P87581 and previous config saved to /var/cache/conftool/dbconfig/20260116-002457-marostegui.json [00:30:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P87582 and previous config saved to /var/cache/conftool/dbconfig/20260116-003044-marostegui.json [00:35:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P87583 and previous config saved to /var/cache/conftool/dbconfig/20260116-003506-marostegui.json [00:40:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1227484 [00:40:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1227484 (owner: 10TrainBranchBot) [00:40:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T413525)', diff saved to https://phabricator.wikimedia.org/P87584 and previous config saved to /var/cache/conftool/dbconfig/20260116-004052-marostegui.json [00:40:58] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [00:41:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1238.eqiad.wmnet with reason: Maintenance [00:41:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87585 and previous config saved to /var/cache/conftool/dbconfig/20260116-004117-marostegui.json [00:45:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T413525)', diff saved to https://phabricator.wikimedia.org/P87586 and previous config saved to /var/cache/conftool/dbconfig/20260116-004514-marostegui.json [00:45:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance [00:45:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2237 (T413525)', diff saved to https://phabricator.wikimedia.org/P87587 and previous config saved to /var/cache/conftool/dbconfig/20260116-004540-marostegui.json [00:52:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1227484 (owner: 10TrainBranchBot) [01:00:58] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1227491 [01:10:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1227491 (owner: 10TrainBranchBot) [01:14:03] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 05s) [01:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:27:05] (03PS1) 10Seawolf35gerrit: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) [01:27:58] (03CR) 10CI reject: [V:04-1] enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:31:01] (03PS2) 10Seawolf35gerrit: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) [01:31:51] (03CR) 10CI reject: [V:04-1] enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:32:53] (03PS3) 10Seawolf35gerrit: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) [01:33:26] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1227491 (owner: 10TrainBranchBot) [01:33:42] (03CR) 10CI reject: [V:04-1] enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:36:10] (03PS4) 10Seawolf35gerrit: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) [01:36:35] (03CR) 10Seawolf35gerrit: "Maybe I've fixed all my bad typing now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:39:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [01:41:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [02:07:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87588 and previous config saved to /var/cache/conftool/dbconfig/20260116-020740-marostegui.json [02:07:47] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:07:47] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:17:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P87589 and previous config saved to /var/cache/conftool/dbconfig/20260116-021748-marostegui.json [02:27:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P87590 and previous config saved to /var/cache/conftool/dbconfig/20260116-022758-marostegui.json [02:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:38:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87591 and previous config saved to /var/cache/conftool/dbconfig/20260116-023806-marostegui.json [02:38:13] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:38:13] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:38:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [03:09:40] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:25] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:54:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:00:16] (03CR) 10Codename Noreste: [C:04-1] "It looks like you forgot to add the user right (editautopatrolprotected) to the bot user group." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [04:13:20] (03CR) 10Seawolf35gerrit: "@codenamenoreste@gmail.com Bots have this right by default. For example, https://phabricator.wikimedia.org/T357298 and https://gerrit.wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [04:14:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:17:04] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:17:26] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:19:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:19:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:30:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87592 and previous config saved to /var/cache/conftool/dbconfig/20260116-043012-marostegui.json [04:30:17] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [04:35:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T413525)', diff saved to https://phabricator.wikimedia.org/P87593 and previous config saved to /var/cache/conftool/dbconfig/20260116-043511-marostegui.json [04:40:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P87594 and previous config saved to /var/cache/conftool/dbconfig/20260116-044020-marostegui.json [04:45:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P87595 and previous config saved to /var/cache/conftool/dbconfig/20260116-044519-marostegui.json [04:50:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P87596 and previous config saved to /var/cache/conftool/dbconfig/20260116-045028-marostegui.json [04:55:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P87597 and previous config saved to /var/cache/conftool/dbconfig/20260116-045527-marostegui.json [05:00:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T413525)', diff saved to https://phabricator.wikimedia.org/P87598 and previous config saved to /var/cache/conftool/dbconfig/20260116-050038-marostegui.json [05:00:44] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [05:00:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1241.eqiad.wmnet with reason: Maintenance [05:01:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1241 (T413525)', diff saved to https://phabricator.wikimedia.org/P87599 and previous config saved to /var/cache/conftool/dbconfig/20260116-050102-marostegui.json [05:05:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T413525)', diff saved to https://phabricator.wikimedia.org/P87600 and previous config saved to /var/cache/conftool/dbconfig/20260116-050536-marostegui.json [05:05:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2239.codfw.wmnet with reason: Maintenance [05:09:11] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:11] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:48:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:48:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87601 and previous config saved to /var/cache/conftool/dbconfig/20260116-054831-marostegui.json [05:48:39] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:48:39] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:16:10] !log marostegui@cumin1003 START - Cookbook sre.wikireplicas.update-views [06:21:03] !log marostegui@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [06:21:13] !log marostegui@cumin1003 START - Cookbook sre.wikireplicas.update-views [06:26:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [06:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:56:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11527720 (10Kris_Litson_WMDE) I also bless this request as the lead of @Johannes_Richter_WMDE [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260116T0700) [07:09:40] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:08] (03CR) 10Muehlenhoff: "openjdk/jdk21 for Bookworm was just a one off to move CAS to Java 21 (when it started depending on it), we're currently not keeping it act" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) (owner: 10Bking) [07:24:09] (03PS3) 10Muehlenhoff: Remove profile::puppet::agent::force_puppet7 from traffic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225524 (https://phabricator.wikimedia.org/T365798) [07:26:50] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::puppet::agent::force_puppet7 from traffic hosts [puppet] - 10https://gerrit.wikimedia.org/r/1225524 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:31:27] (03PS1) 10Muehlenhoff: Remove remaining traces of profile::puppet::agent::force_puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1227616 (https://phabricator.wikimedia.org/T365798) [07:43:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:43:12] 06SRE, 10LDAP-Access-Requests: Grant Access to wmde and nda for Johannes Richter WMDE - https://phabricator.wikimedia.org/T404080#11527767 (10Dzahn) please see T414678#11524961 [07:49:00] (03PS1) 10Muehlenhoff: Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) [07:50:33] (03CR) 10Muehlenhoff: "Maintaining component/jdk21 for Bookworm is also an option, if e.g. OpenSearch isn't yet compatible with Trixie otherwise." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227376 (https://phabricator.wikimedia.org/T414695) (owner: 10Bking) [07:50:50] (03PS2) 10Muehlenhoff: Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) [07:52:48] (03CR) 10CI reject: [V:04-1] Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:53:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [07:54:38] !log phabricator - addign Johannes_Richter_WMDE to WMF-NDA T414678 [07:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:44] T414678: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678 [07:55:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11527773 (10Dzahn) @Johannes_Richter_WMDE Yes, that is still common practice. It must have been overlooked back then in that other task (left a comment there). I co... [07:56:43] (03CR) 10Dpogorzelski: [C:03+2] Add vLLM image in ML namespace [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [07:56:53] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] Add vLLM image in ML namespace [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [07:58:11] !log phabricator - adding Martyn.ranyard to WMF-NDA (T413994) [07:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:16] T413994: Grant Access to wmde for martyn.ranyard - https://phabricator.wikimedia.org/T413994 [07:58:51] !log phabricator - adding kimpham to WMF-NDA (T414157) [07:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:56] T414157: Grant Access to wmde, nda for Kim.pham - https://phabricator.wikimedia.org/T414157 [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260116T0800) [08:00:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11527777 (10Dzahn) Also added recently NDAed WMDE users @kimpham and @Martyn.ranyard Sorry about missing this at first. [08:02:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:03:29] (03PS3) 10Muehlenhoff: Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) [08:04:25] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:09:25] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1227352 (https://phabricator.wikimedia.org/T410314) (owner: 10Ayounsi) [08:23:04] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11527822 (10elukey) @MatthewVernon Hi! I tried to manually delete some tests from the registry's bucket, both from eqiad (via s3cmd) and codfw (via the registry's G... [08:26:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:12] (03PS1) 10Muehlenhoff: Move validatecloudvpsfqdn.py out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1227694 (https://phabricator.wikimedia.org/T365798) [08:31:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2240.codfw.wmnet with reason: Maintenance [08:32:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:32:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T413525)', diff saved to https://phabricator.wikimedia.org/P87602 and previous config saved to /var/cache/conftool/dbconfig/20260116-083206-marostegui.json [08:32:12] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:32:58] 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T414775 (10WMDE-leszek) 03NEW [08:32:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227694 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:34:15] (03PS1) 10Elukey: ml: fix vllm's image builder config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227697 (https://phabricator.wikimedia.org/T385173) [08:35:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:36:02] (03CR) 10JMeybohm: [C:03+1] "nice, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/978615 (owner: 10Muehlenhoff) [08:37:04] (03CR) 10Kevin Bazira: [C:03+1] ml: fix vllm's image builder config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227697 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [08:39:34] (03CR) 10Kevin Bazira: Add vLLM image in ML namespace (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [08:43:23] (03CR) 10Filippo Giunchedi: [C:03+1] Move validatecloudvpsfqdn.py out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1227694 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:44:05] (03CR) 10Filippo Giunchedi: [C:03+1] Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:45:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T413525)', diff saved to https://phabricator.wikimedia.org/P87603 and previous config saved to /var/cache/conftool/dbconfig/20260116-084557-marostegui.json [08:46:03] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [08:46:17] (03PS1) 10Muehlenhoff: Remove puppetmaster spec files [puppet] - 10https://gerrit.wikimedia.org/r/1227698 (https://phabricator.wikimedia.org/T365798) [08:49:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11527874 (10Johannes_Richter_WMDE) Thanks! [08:50:01] !log depool titan2001, cleaning up block 01K88XDMJ9S0T2DR5K00VG9CFE (T410152) [08:50:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [08:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:07] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [08:53:02] (03CR) 10Elukey: [C:03+1] Remove remaining traces of profile::puppet::agent::force_puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1227616 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:55:42] (03PS1) 10Muehlenhoff: Copy yamllint into the puppetserver module and use it [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) [08:56:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P87604 and previous config saved to /var/cache/conftool/dbconfig/20260116-085605-marostegui.json [08:58:01] (03CR) 10CI reject: [V:04-1] Copy yamllint into the puppetserver module and use it [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:58:26] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11527884 (10MatthewVernon) ` mvernon@moss-be1001:~$ sudo cephadm shell -- radosgw-admin bucket sync status --bucket=registry-restricted Inferring fsid 3f38ada2-2d88... [09:01:35] (03PS1) 10Elukey: docker_registry: allor to set the loglevel for an instance [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) [09:02:06] (03CR) 10CI reject: [V:04-1] docker_registry: allor to set the loglevel for an instance [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [09:02:20] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [09:03:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87605 and previous config saved to /var/cache/conftool/dbconfig/20260116-090353-marostegui.json [09:04:02] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:04:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:04:06] 06SRE, 10DNS, 06serviceops, 06Traffic, and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11527892 (10Dzahn) Yea, it is. Languages would typically be added to `dns/templates/helpers/langlist.tmpl` but it feels like adding a non-language to the "... [09:05:43] (03PS1) 10Dzahn: add abstract.wikipedia.org to section for wikis not covered by langlist [dns] - 10https://gerrit.wikimedia.org/r/1227706 (https://phabricator.wikimedia.org/T411724) [09:06:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P87606 and previous config saved to /var/cache/conftool/dbconfig/20260116-090614-marostegui.json [09:07:08] (03PS2) 10Muehlenhoff: Copy yamllint into the puppetserver module and use it [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) [09:07:21] 06SRE, 10DNS, 06serviceops, 06Traffic, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11527902 (10Dzahn) I would think it belongs into the section for `Wikis with mobile site (alphabetic order), which are not covered by langlist.tmpl`. htt... [09:08:22] 06SRE, 10DNS, 06serviceops, 06Traffic, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11527904 (10Dzahn) Note that there is also a section for ` Wikis without mobile site (alphabetic order), which are not covered by langlist.tmpl` right belo... [09:09:21] (03PS2) 10Elukey: docker_registry: allor to set the loglevel for an instance [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) [09:10:01] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [09:13:17] (03PS2) 10Dzahn: add abstract.wikipedia.org to section for wikis not covered by langlist [dns] - 10https://gerrit.wikimedia.org/r/1227706 (https://phabricator.wikimedia.org/T411724) [09:14:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P87607 and previous config saved to /var/cache/conftool/dbconfig/20260116-091401-marostegui.json [09:15:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:15:32] !log attempting soft reboot of instance codesearch9.codesearch - down and can't connect - T414776 [09:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:37] T414776: Codesearch is down/unreachable (2026-01-16) - https://phabricator.wikimedia.org/T414776 [09:16:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T413525)', diff saved to https://phabricator.wikimedia.org/P87608 and previous config saved to /var/cache/conftool/dbconfig/20260116-091623-marostegui.json [09:16:31] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:16:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1242.eqiad.wmnet with reason: Maintenance [09:16:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1242 (T413525)', diff saved to https://phabricator.wikimedia.org/P87609 and previous config saved to /var/cache/conftool/dbconfig/20260116-091649-marostegui.json [09:22:33] 06SRE, 10LDAP-Access-Requests, 10Phabricator: undisable vanderwaalforces in phabricator and ldap - https://phabricator.wikimedia.org/T414774#11527922 (10taavi) I have already confirmed this with T&S based on a private email request. I still need to double-check how to invalidate the existing password to forc... [09:24:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P87611 and previous config saved to /var/cache/conftool/dbconfig/20260116-092410-marostegui.json [09:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:25:10] 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T414775#11527926 (10Dzahn) Done! Confirmed both are in our NDA spreadsheet and the Phab WMF-NDA group. Added them to the Wikitech page. [09:25:44] 10SRE-Access-Requests: Update the list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T414775#11527927 (10Dzahn) 05Open→03Resolved a:03Dzahn [09:29:00] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops, 13Patch-For-Review: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11527941 (10elukey) @MatthewVernon ah nice I used the wrong bucket name when checking the config, I still don't explain that error on apus-fe... [09:29:43] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11527944 (10Dzahn) a:05thcipriani→03None [09:30:04] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11527946 (10Dzahn) a:05thcipriani→03None [09:31:30] PROBLEM - SSH on stat1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:32:20] RECOVERY - SSH on stat1010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:32:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T413525)', diff saved to https://phabricator.wikimedia.org/P87612 and previous config saved to /var/cache/conftool/dbconfig/20260116-093232-marostegui.json [09:32:39] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:34:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87613 and previous config saved to /var/cache/conftool/dbconfig/20260116-093418-marostegui.json [09:34:27] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:34:28] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:34:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [09:34:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87614 and previous config saved to /var/cache/conftool/dbconfig/20260116-093444-marostegui.json [09:35:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:13] (03PS1) 10Dzahn: admin: add Aisha Khatun to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1227718 (https://phabricator.wikimedia.org/T414347) [09:38:08] (03CR) 10Ayounsi: "Adding a few reviewers to hopefully unblock it until we increase transport capacity." [puppet] - 10https://gerrit.wikimedia.org/r/1218784 (https://phabricator.wikimedia.org/T411617) (owner: 10Cathal Mooney) [09:40:02] (03PS15) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [09:40:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:42:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P87615 and previous config saved to /var/cache/conftool/dbconfig/20260116-094240-marostegui.json [09:44:47] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11527995 (10FCeratto-WMF) [09:48:18] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11528001 (10FCeratto-WMF) [09:52:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P87616 and previous config saved to /var/cache/conftool/dbconfig/20260116-095248-marostegui.json [09:53:19] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11528013 (10FCeratto-WMF) Pending out of band SSH verification. [09:54:25] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11528020 (10FCeratto-WMF) [09:54:28] (03PS1) 10Gehel: wdqs: setup new test servers for Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1227726 (https://phabricator.wikimedia.org/T412235) [09:55:00] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11528025 (10FCeratto-WMF) [09:56:08] (03PS2) 10Dzahn: zookeeper: add ssl.keyStore.passwordPath [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119) [09:56:18] (03CR) 10Btullis: [C:03+1] wdqs: setup new test servers for Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1227726 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [09:56:47] !log remove static routes for magru ranges on cr1-eqiad to revert load-balance of transport traffic T414473 (https://phabricator.wikimedia.org/P87617) [09:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:53] T414473: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473 [09:57:21] (03PS3) 10Dzahn: zookeeper: add ssl.keyStore.passwordPath [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119) [09:57:40] (03PS4) 10Dzahn: zookeeper: add ssl.keyStore.passwordPath [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119) [10:02:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T413525)', diff saved to https://phabricator.wikimedia.org/P87618 and previous config saved to /var/cache/conftool/dbconfig/20260116-100257-marostegui.json [10:03:04] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:03:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2245.codfw.wmnet with reason: Maintenance [10:03:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2245 (T413525)', diff saved to https://phabricator.wikimedia.org/P87619 and previous config saved to /var/cache/conftool/dbconfig/20260116-100322-marostegui.json [10:04:40] (03PS1) 10Gehel: wdqs: cleanup site.pp entries for WDQS to make it more readable [puppet] - 10https://gerrit.wikimedia.org/r/1227728 [10:04:51] (03CR) 10Dzahn: [C:04-1] "actually.. we want to set the path to a file containing the password, not the password itself and also not mix zuul and zookeper lookups ." [puppet] - 10https://gerrit.wikimedia.org/r/1224908 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [10:07:15] (03PS2) 10Gehel: wdqs: cleanup site.pp entries for WDQS to make it more readable [puppet] - 10https://gerrit.wikimedia.org/r/1227728 [10:07:50] (03CR) 10Gehel: [C:03+2] wdqs: setup new test servers for Blazegraph alternatives [puppet] - 10https://gerrit.wikimedia.org/r/1227726 (https://phabricator.wikimedia.org/T412235) (owner: 10Gehel) [10:12:14] (03PS1) 10Brouberol: Define the airflow-sre public and internal domains [dns] - 10https://gerrit.wikimedia.org/r/1227731 (https://phabricator.wikimedia.org/T402512) [10:12:56] (03PS1) 10Brouberol: Define the airflow-sre kubeconfig files [puppet] - 10https://gerrit.wikimedia.org/r/1227732 (https://phabricator.wikimedia.org/T402512) [10:12:59] (03PS1) 10Brouberol: Setup the caching and ATS rules to publicly expose airflow-sre.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) [10:14:05] (03PS1) 10Muehlenhoff: pcc_update_facts: Rename variables [puppet] - 10https://gerrit.wikimedia.org/r/1227734 (https://phabricator.wikimedia.org/T365798) [10:14:43] (03CR) 10Elukey: [C:03+1] "two nits but LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1227731 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:15:27] (03CR) 10Elukey: [C:03+1] Define the airflow-sre kubeconfig files [puppet] - 10https://gerrit.wikimedia.org/r/1227732 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:15:50] (03CR) 10Elukey: [C:03+1] Setup the caching and ATS rules to publicly expose airflow-sre.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:19:50] (03PS1) 10Dzahn: zuul: write TLS passphrase to a file for zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/1227735 (https://phabricator.wikimedia.org/T405119) [10:20:24] (03CR) 10Vgutierrez: "please do not merge till airflow-sre.discovery.wmnet is available" [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:25:09] (03PS1) 10Bartosz Wójtowicz: ml-services: Lower resource usage for article-descriptions on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227736 (https://phabricator.wikimedia.org/T414431) [10:25:23] (03CR) 10Elukey: [C:03+1] "Also if possible let's rework the commit msg to something more meaningful. Maybe something like "trafficserver: setup caching and etc.."." [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:26:08] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1227735/7903/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1227735 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [10:34:55] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11528171 (10FCeratto-WMF) [10:35:04] (03CR) 10Clément Goubert: wikikube: decommission worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) (owner: 10Jasmine) [10:35:12] (03CR) 10Clément Goubert: wikikube: decommission wikikube-worker[2116-2123,2216-2241].codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227454 (https://phabricator.wikimedia.org/T409104) (owner: 10Jasmine) [10:35:17] (03PS2) 10Brouberol: trafficserver: setup caching and ATS rules to publicly expose airflow-sre.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) [10:35:28] (03CR) 10Brouberol: "Yep, as usual!" [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:35:44] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11528172 (10FCeratto-WMF) [10:36:45] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO access for dduvall - https://phabricator.wikimedia.org/T414619#11528184 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:39:48] !log installing Linux 6.12.63 on trixie hosts [10:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11528193 (10BTullis) I'm slightly confused by this, now. Has the drive swap already been done, @VRiley-WMF ? I'm checking the output from `sudo perccli64... [10:42:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11528194 (10BTullis) [10:42:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1227718 (https://phabricator.wikimedia.org/T414347) (owner: 10Dzahn) [10:45:02] (03CR) 10Btullis: [C:03+1] wdqs: cleanup site.pp entries for WDQS to make it more readable [puppet] - 10https://gerrit.wikimedia.org/r/1227728 (owner: 10Gehel) [10:48:23] (03CR) 10Brouberol: [C:03+2] Define the airflow-sre kubeconfig files [puppet] - 10https://gerrit.wikimedia.org/r/1227732 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:55:00] (03PS2) 10Trueg: blazegraph: alert on ratio of failed queries increase [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) [10:59:17] (03CR) 10Trueg: "I lowered the threshold to `0.1` which might still be too high (especially considering that I changed the metric from `30m` to `5m` which " [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) (owner: 10Trueg) [11:06:37] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:08:47] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:10:51] (03CR) 10Dzahn: [C:03+2] admin: add Aisha Khatun to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1227718 (https://phabricator.wikimedia.org/T414347) (owner: 10Dzahn) [11:13:08] (03CR) 10Kosta Harlan: IPReputation: Define data provider, URL and developer mode config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223635 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [11:14:14] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11528241 (10Dzahn) Hi @AKhatun_WMF you have been added to the deployment group. Welcome to deployers! Access to deployment servers should work within ~... [11:15:07] (03PS1) 10Dpogorzelski: ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 [11:15:18] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11528243 (10Dzahn) [11:15:46] (03PS4) 10Kosta Harlan: (WIP) IPReputation: Enable OpenSearch IPoid provider on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) [11:15:53] (03PS4) 10Kosta Harlan: IPReputation: Define data provider, URL and developer mode config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223635 (https://phabricator.wikimedia.org/T410615) [11:15:53] (03PS5) 10Kosta Harlan: (WIP) IPReputation: Enable OpenSearch IPoid provider on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) [11:16:30] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11528244 (10Dzahn) Closing this as resolved. For logstash access please see the comment from Tyler above. That's self-service via idm.wikimedia.org.... [11:16:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11528245 (10BTullis) 05In progress→03Resolved This is back up and running with 12 data drives. ` btullis@an-worker1148... [11:19:45] (03CR) 10Elukey: ml-build: add missing configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:20:09] (03CR) 10Btullis: [C:03+1] Define the airflow-sre public and internal domains [dns] - 10https://gerrit.wikimedia.org/r/1227731 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [11:20:32] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE, 13Patch-For-Review: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11528266 (10Dzahn) 05Open→03Resolved a:03Dzahn @AKhatun_WMF As the next step please also request the "spiderpig-access" group via IDM. See h... [11:20:39] (03CR) 10Btullis: [C:03+1] trafficserver: setup caching and ATS rules to publicly expose airflow-sre.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [11:21:42] (03PS2) 10Dpogorzelski: ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 [11:22:49] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11528272 (10FCeratto-WMF) [Pinged RKemper on IRC] [11:23:32] (03PS3) 10Dpogorzelski: ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 [11:23:40] (03CR) 10Dpogorzelski: ml-build: add missing configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:27:04] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:27:49] (03PS1) 10Muehlenhoff: Remove profile::admin::groups from old mediawiki roles [puppet] - 10https://gerrit.wikimedia.org/r/1227744 [11:27:49] (03PS1) 10Muehlenhoff: Remove mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/1227745 [11:28:26] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:29:16] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS1 Status - issue on clouddb1024:9290 - https://phabricator.wikimedia.org/T414681#11528290 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced power cable [11:30:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11528293 (10Jclark-ctr) @cmooney i have disconnected all the switches [11:30:37] (03CR) 10Elukey: "Dawid for some reason the 'auto' selector seems to lead to `WARNING: no nodes found for class: Class/Profile::Docker::Ml_builder` and then" [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [11:34:15] (03PS1) 10Kevin Bazira: ml-services: bump revertrisk CPU limit (ResourceQuota) for RR namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227746 (https://phabricator.wikimedia.org/T414060) [11:44:42] (03CR) 10Clément Goubert: [C:03+1] Remove mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/1227745 (owner: 10Muehlenhoff) [11:44:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11528329 (10Jclark-ctr) These will be arriving next week [11:45:14] (03CR) 10Clément Goubert: [C:03+1] Remove profile::admin::groups from old mediawiki roles [puppet] - 10https://gerrit.wikimedia.org/r/1227744 (owner: 10Muehlenhoff) [11:46:24] (03PS2) 10Kevin Bazira: ml-services: bump CPU limit (ResourceQuota) for revertrisk namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227746 (https://phabricator.wikimedia.org/T414060) [11:46:29] (03PS1) 10Daphne Smit: [wikifunctions] Grant sysops permission to edit function of attached implementation and tester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227748 (https://phabricator.wikimedia.org/T399934) [11:46:38] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mwlog1003.eqiad.wmnet with OS bookworm [11:46:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11528333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host mwlog1003.eqiad.wmnet with OS bookworm executed with erro... [11:47:55] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789 (10Johannnes89) 03NEW [11:49:20] (03PS4) 10Dpogorzelski: ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 [11:50:11] (03CR) 10Dpogorzelski: [C:03+2] ml-services: bump CPU limit (ResourceQuota) for revertrisk namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227746 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [11:53:48] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11528372 (10MoritzMuehlenhoff) [11:57:21] (03Merged) 10jenkins-bot: ml-services: bump CPU limit (ResourceQuota) for revertrisk namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227746 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [11:57:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog1003 - https://phabricator.wikimedia.org/T412230#11528390 (10Jclark-ctr) {F71539174} Server is currently failing to image. I’ve reached out to Herron to review the RAID configuration in Puppet. [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260116T0800) [12:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260116T1200). [12:09:40] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:02] (03PS1) 10Brouberol: Tweak druid configuration to enable druid 27.0 to run on jvm8 [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) [12:10:14] (03PS2) 10Brouberol: Tweak druid configuration to enable druid 27.0 to run on jvm8 [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) [12:10:45] (03CR) 10CI reject: [V:04-1] Tweak druid configuration to enable druid 27.0 to run on jvm8 [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [12:11:23] mutante: see the prometheus-node-textfile-check-nft.service alerts for tcp-proxy, these are also remnants of the former use of nftables on tcp-proxy* and will need manual cleanup [12:13:37] (03PS3) 10Brouberol: Tweak druid configuration to enable druid 27.0 to run on jvm8 [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) [12:15:02] (03CR) 10Joal: [C:03+1] Tweak druid configuration to enable druid 27.0 to run on jvm8 [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [12:17:00] (03CR) 10Btullis: Tweak druid configuration to enable druid 27.0 to run on jvm8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [12:17:44] (03CR) 10Brouberol: Tweak druid configuration to enable druid 27.0 to run on jvm8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [12:18:35] (03PS4) 10Brouberol: Tweak druid configuration to enable druid 27.0 to run on jvm8 [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) [12:18:39] (03PS1) 10Elukey: role::cephadm::rgw: enable access logs for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) [12:18:47] (03CR) 10Brouberol: Tweak druid configuration to enable druid 27.0 to run on jvm8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [12:19:00] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [12:20:32] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Nit-in-my-head about having "v6_prefixes" with no netmask to show they are /64s. But it'd be too messy to include the prefixlen, a" [puppet] - 10https://gerrit.wikimedia.org/r/1227352 (https://phabricator.wikimedia.org/T410314) (owner: 10Ayounsi) [12:20:43] (03PS2) 10Elukey: role::cephadm::rgw: enable access logs for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) [12:21:01] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [12:22:08] (03PS3) 10Elukey: role::cephadm::rgw: enable access logs for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) [12:22:18] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [12:25:06] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops, 13Patch-For-Review: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11528530 (10elukey) To rule out any possible issues with the registry v 2.8, I build the 3.0 release, uploaded the binary and tested it on reg... [12:25:54] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [12:26:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:31:34] (03CR) 10Btullis: [C:03+1] Tweak druid configuration to enable druid 27.0 to run on jvm8 [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [12:32:17] (03PS5) 10Brouberol: Tweak druid configuration to enable druid 27.0 to run on jvm8 [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) [12:32:51] (03CR) 10Ayounsi: [C:03+2] Routed ganeti: move v6_prefixes to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1227352 (https://phabricator.wikimedia.org/T410314) (owner: 10Ayounsi) [12:35:02] (03CR) 10Brouberol: [C:03+2] Tweak druid configuration to enable druid 27.0 to run on jvm8 [puppet] - 10https://gerrit.wikimedia.org/r/1227754 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [12:54:58] (03PS1) 10Jcrespo: install: Reimage with format backup1015-backup1020 & backup2015-backup2020 [puppet] - 10https://gerrit.wikimedia.org/r/1227773 (https://phabricator.wikimedia.org/T414727) [12:58:05] (03CR) 10Jcrespo: [C:03+2] install: Reimage with format backup1015-backup1020 & backup2015-backup2020 [puppet] - 10https://gerrit.wikimedia.org/r/1227773 (https://phabricator.wikimedia.org/T414727) (owner: 10Jcrespo) [12:58:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T413525)', diff saved to https://phabricator.wikimedia.org/P87620 and previous config saved to /var/cache/conftool/dbconfig/20260116-125840-marostegui.json [12:58:48] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:01:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11528619 (10BTullis) I have also made the following ticket regarding upgrading he 1 Gbps network connections: {T414787} [13:04:30] (03CR) 10JMeybohm: [C:04-1] docker_registry: allor to set the loglevel for an instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [13:05:31] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [13:06:09] (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1223649 (https://phabricator.wikimedia.org/T412805) (owner: 10JMeybohm) [13:08:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P87621 and previous config saved to /var/cache/conftool/dbconfig/20260116-130848-marostegui.json [13:10:28] 10ops-eqiad, 06DC-Ops: Upgrade any 1 Gbps dse-k8s-worker-eqiad network interfaces to 10 Gbps - https://phabricator.wikimedia.org/T414787#11528651 (10BTullis) @wiki_willy @RobH @cmooney - Sorry to trouble you with this. Would such an upgrade for the network on these 9 hosts be possible, please? We have rename... [13:13:26] (03CR) 10Ozge: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227736 (https://phabricator.wikimedia.org/T414431) (owner: 10Bartosz Wójtowicz) [13:13:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11528657 (10jcrespo) [13:14:40] (03CR) 10Marostegui: [C:03+1] "Let's merge on Monday all the tests I did were ok. Let's not rename till the date we agreed on, but this can be merged." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [13:14:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11528658 (10jcrespo) This is almost done-ready to reimage (partman is ready), but I want to give a last review to the RAID controller setup and UEFI, to see i... [13:18:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P87622 and previous config saved to /var/cache/conftool/dbconfig/20260116-131857-marostegui.json [13:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:25:41] 06SRE, 06ServiceOps new, 13Patch-For-Review: Migrate ipblocks from fetch_external_clouds_vendors_nets.py to HIDDENPARMA - https://phabricator.wikimedia.org/T412805#11528682 (10MLechvien-WMF) [13:29:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T413525)', diff saved to https://phabricator.wikimedia.org/P87623 and previous config saved to /var/cache/conftool/dbconfig/20260116-132905-marostegui.json [13:29:11] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:29:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11528688 (10jcrespo) [13:29:18] 10ops-eqiad, 06DC-Ops: Upgrade any 1 Gbps dse-k8s-worker-eqiad network interfaces to 10 Gbps - https://phabricator.wikimedia.org/T414787#11528691 (10cmooney) >>! In T414787#11528651, @BTullis wrote: > @wiki_willy @RobH @cmooney - Sorry to trouble you with this. > > Would such an upgrade for the network on the... [13:29:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1243.eqiad.wmnet with reason: Maintenance [13:29:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1243 (T413525)', diff saved to https://phabricator.wikimedia.org/P87624 and previous config saved to /var/cache/conftool/dbconfig/20260116-132930-marostegui.json [13:30:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11528695 (10jcrespo) [13:31:26] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.newpool: [de]pool various section kinds (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [13:32:35] (03CR) 10Marostegui: [C:03+1] "I suggested Monday, but it is okay, nothing really uses it right now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [13:33:00] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11528708 (10jcrespo) [13:33:35] (03PS1) 10Brouberol: an-test-druid: disable noisy GC stat logging [puppet] - 10https://gerrit.wikimedia.org/r/1227786 (https://phabricator.wikimedia.org/T278056) [13:34:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:25] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:39] (03CR) 10Kamila Součková: wikikube: decommission worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) (owner: 10Jasmine) [13:39:25] FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup2015 - https://phabricator.wikimedia.org/T414724#11528732 (10jcrespo) [13:43:59] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11528742 (10Gehel) [13:45:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11528755 (10jcrespo) [13:48:11] (03CR) 10Codename Noreste: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [13:49:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11528789 (10FCeratto-WMF) [13:49:58] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11528801 (10jcrespo) [13:53:08] (03PS1) 10Jcrespo: backup: Setup ms-backup[12]00[34] [puppet] - 10https://gerrit.wikimedia.org/r/1227789 (https://phabricator.wikimedia.org/T414717) [13:53:38] (03CR) 10CI reject: [V:04-1] backup: Setup ms-backup[12]00[34] [puppet] - 10https://gerrit.wikimedia.org/r/1227789 (https://phabricator.wikimedia.org/T414717) (owner: 10Jcrespo) [13:58:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11528892 (10FCeratto-WMF) [13:59:40] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade any 1 Gbps dse-k8s-worker-eqiad network interfaces to 10 Gbps - https://phabricator.wikimedia.org/T414787#11528902 (10Jclark-ctr) @BTullis The servers racked in rows A and B are in 1G-only racks, so available space in those rows is limited. We recently completed a sel... [14:00:26] (03CR) 10Gehel: [C:03+2] wdqs: cleanup site.pp entries for WDQS to make it more readable [puppet] - 10https://gerrit.wikimedia.org/r/1227728 (owner: 10Gehel) [14:00:46] (03PS1) 10Jcrespo: backup: Set up backup1015-backup1020 & backup2015-backup2020 [puppet] - 10https://gerrit.wikimedia.org/r/1227792 (https://phabricator.wikimedia.org/T414728) [14:00:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T413525)', diff saved to https://phabricator.wikimedia.org/P87625 and previous config saved to /var/cache/conftool/dbconfig/20260116-140057-marostegui.json [14:01:03] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:01:30] (03CR) 10CI reject: [V:04-1] backup: Set up backup1015-backup1020 & backup2015-backup2020 [puppet] - 10https://gerrit.wikimedia.org/r/1227792 (https://phabricator.wikimedia.org/T414728) (owner: 10Jcrespo) [14:01:58] moritzm: yes, expect I already did that cleanup yesterday but for some reason the units came back. ack [14:06:13] (03PS1) 10Jgiannelos: mobileapps: Set limits on memory usage to avoid latency increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227799 (https://phabricator.wikimedia.org/T410296) [14:08:36] (03PS1) 10Dpogorzelski: ml_builder: add missing prod_build_password [labs/private] - 10https://gerrit.wikimedia.org/r/1227800 [14:09:28] (03PS2) 10Jgiannelos: mobileapps: Set limits on memory usage to avoid latency increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227799 (https://phabricator.wikimedia.org/T410296) [14:09:43] (03PS2) 10Dpogorzelski: ml_builder: add missing prod_build_password [labs/private] - 10https://gerrit.wikimedia.org/r/1227800 [14:10:46] (03PS5) 10Dpogorzelski: ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 [14:11:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P87626 and previous config saved to /var/cache/conftool/dbconfig/20260116-141105-marostegui.json [14:12:50] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 (10MatthewVernon) 03NEW [14:14:25] RESOLVED: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:07] (03CR) 10Elukey: ml_builder: add missing prod_build_password (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1227800 (owner: 10Dpogorzelski) [14:19:09] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11529006 (10MatthewVernon) 05Open→03Resolved I think we're settled on this set of sizes. [14:21:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245', diff saved to https://phabricator.wikimedia.org/P87627 and previous config saved to /var/cache/conftool/dbconfig/20260116-142114-marostegui.json [14:22:54] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-SharedInfra: docker-registry "Last updated at" time should specify TZ - https://phabricator.wikimedia.org/T404010#11529013 (10Clement_Goubert) p:05Triage→03Low [14:22:58] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11529015 (10MatthewVernon) [14:23:07] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11529019 (10MatthewVernon) 05Open→03Resolved [14:23:23] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11529021 (10FCeratto-WMF) Hello @DannyS712 sorry for the ping, when you have a second could you please reply to https://phabricator.wikimedia.o... [14:23:57] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-SharedInfra: docker-registry "Last updated at" time should specify TZ - https://phabricator.wikimedia.org/T404010#11529022 (10Clement_Goubert) [14:26:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:27:45] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-SharedInfra: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008#11529038 (10Clement_Goubert) p:05Triage→03Low [14:28:27] !log asw1-b12-drmrs> restart statistics-service - T413181 [14:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:32] T413181: asw1-b12-drmrs stopped reporting metrics - https://phabricator.wikimedia.org/T413181 [14:30:46] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11529045 (10KOfori) Approved. [14:31:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2245 (T413525)', diff saved to https://phabricator.wikimedia.org/P87628 and previous config saved to /var/cache/conftool/dbconfig/20260116-143122-marostegui.json [14:31:27] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:31:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2246.codfw.wmnet with reason: Maintenance [14:31:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2246 (T413525)', diff saved to https://phabricator.wikimedia.org/P87629 and previous config saved to /var/cache/conftool/dbconfig/20260116-143147-marostegui.json [14:32:55] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, 07good first task: docker-registry "Last updated at" time should specify TZ - https://phabricator.wikimedia.org/T404010#11529068 (10Clement_Goubert) Thank you for tagging this task with #good_first_task for Wikimedia newcomers... [14:33:13] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, 07good first task: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008#11529071 (10Clement_Goubert) Thank you for tagging this task with #good_first_task for Wikimedia newc... [14:33:56] (03CR) 10Elukey: "Once removed the extra settings feel free to merge!" [labs/private] - 10https://gerrit.wikimedia.org/r/1227800 (owner: 10Dpogorzelski) [14:34:09] (03CR) 10MVernon: [C:03+1] "I double-checked the VLANs in netbox; this seems sensible to me." [puppet] - 10https://gerrit.wikimedia.org/r/1218784 (https://phabricator.wikimedia.org/T411617) (owner: 10Cathal Mooney) [14:35:07] (03PS1) 10Federico Ceratto: admin: Add johannesrichterwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1227809 (https://phabricator.wikimedia.org/T414678) [14:35:07] (03CR) 10Federico Ceratto: "Could you please take a look when you have a sec? Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1227809 (https://phabricator.wikimedia.org/T414678) (owner: 10Federico Ceratto) [14:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:39:01] (03CR) 10MVernon: [C:03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [14:39:55] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, 07good first task: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008#11529087 (10Clement_Goubert) [14:40:22] moritzm: cleaned up. laters:) [14:41:04] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, 07good first task: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008#11529091 (10Clement_Goubert) [14:42:07] 06SRE, 10DNS, 06serviceops, 06Traffic, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11529094 (10ssingh) Yes, I should have clarified better, sorry. There is nothing special about `langlist.tmpl` as such. It just lists the language editions... [14:42:58] (03PS1) 10Dzahn: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227811 (https://phabricator.wikimedia.org/T408592) [14:43:06] (03CR) 10CI reject: [V:04-1] miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227811 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [14:43:20] (03PS2) 10Dzahn: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227811 (https://phabricator.wikimedia.org/T408592) [14:43:41] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'clear' for AS: 4800 [14:44:32] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'clear' for AS: 4800 [14:44:43] (03CR) 10Dzahn: [C:03+2] miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227811 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [14:45:13] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11529106 (10AKhatun_WMF) Thanks! Do I need a different set of permissions to access `an-launcher`? I am getting a permissio... [14:46:36] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Requesting deployment access for AKhatun - https://phabricator.wikimedia.org/T414347#11529110 (10Dzahn) Yes, an-launcher is unrelated to deployment. Data-Platform should be able to help with that one. [14:46:42] (03Merged) 10jenkins-bot: miscweb: update wikipedia25 image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227811 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [14:46:46] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11529111 (10FCeratto-WMF) [14:47:29] 06SRE, 10DNS, 06serviceops, 06Traffic, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11529117 (10ssingh) ` {% from "helpers/langlist.tmpl" import langs %} {% for lang in langs -%} {{ lang }} 1D IN CNAME dyna.wikimedia.org. {{ lang }}.m 1D... [14:47:37] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529119 (10cmooney) >>! In T81605#11522551, @ssingh wrote: > @cmooney: Any picks for your favourite v6 address for `ns1`? I was thinking of allocating `2620:0:860:ed1a::4/128` under LVS service I... [14:48:29] !log dzahn@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:48:51] !log dzahn@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:49:29] !log dzahn@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:49:47] !log dzahn@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:50:16] !log dzahn@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:50:38] !log dzahn@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:50:55] (03PS2) 10Jcrespo: backup: Setup ms-backup[12]00[34] [puppet] - 10https://gerrit.wikimedia.org/r/1227789 (https://phabricator.wikimedia.org/T414717) [14:51:12] (03PS2) 10Jcrespo: backup: Set up backup1015-backup1020 & backup2015-backup2020 [puppet] - 10https://gerrit.wikimedia.org/r/1227792 (https://phabricator.wikimedia.org/T414728) [14:52:45] (03PS2) 10Federico Ceratto: admin: Add johannesrichterwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1227809 (https://phabricator.wikimedia.org/T414678) [14:54:22] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529143 (10cmooney) >>! In T81605#11518553, @ssingh wrote: > Our glue records also have a disparity. I was interested to know what effect this would have. One data-point for Bind (at least my l... [14:58:38] (03PS3) 10Elukey: docker_registry: allor to set the loglevel for an instance [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) [14:58:57] (03CR) 10Elukey: docker_registry: allor to set the loglevel for an instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [15:01:37] (03CR) 10Elukey: "Hi Matthew! At the moment I am very confused by a HTTP PUT that starts from the registry, ending up in a HTTP 504 from envoy. On the Ceph " [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [15:07:51] (03PS1) 10Federico Ceratto: admin: Add dr0ptp4kt to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1227816 (https://phabricator.wikimedia.org/T412875) [15:09:09] (03CR) 10Elukey: "Details in https://phabricator.wikimedia.org/T394476#11528530" [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [15:09:11] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:15] (03CR) 10Ssingh: add abstract.wikipedia.org to section for wikis not covered by langlist (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1227706 (https://phabricator.wikimedia.org/T411724) (owner: 10Dzahn) [15:10:45] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227818 [15:21:04] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-SharedInfra: docker-registry "Last updated at" time should specify TZ - https://phabricator.wikimedia.org/T404010#11529266 (10Clement_Goubert) Removing #good_first_task as it is a good first task for a ServiceOps staff member, not outside co... [15:21:17] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-SharedInfra: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008#11529272 (10Clement_Goubert) Removing #good_first_task as it is a good first task for a ServiceOps staff member, not outsi... [15:23:51] (03PS3) 10Dpogorzelski: ml_builder: add missing prod_build_password [labs/private] - 10https://gerrit.wikimedia.org/r/1227800 [15:24:15] (03CR) 10Dpogorzelski: [C:03+2] ml_builder: add missing prod_build_password (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1227800 (owner: 10Dpogorzelski) [15:24:17] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml_builder: add missing prod_build_password [labs/private] - 10https://gerrit.wikimedia.org/r/1227800 (owner: 10Dpogorzelski) [15:27:24] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [15:28:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1227809 (https://phabricator.wikimedia.org/T414678) (owner: 10Federico Ceratto) [15:28:38] (03PS1) 10Clément Goubert: charts: Remove unused chart mediawiki-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227821 (https://phabricator.wikimedia.org/T401197) [15:28:54] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1227816 (https://phabricator.wikimedia.org/T412875) (owner: 10Federico Ceratto) [15:32:17] 07sre-alert-triage, 10Prod-Kubernetes, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11529307 (10Clement_Goubert) 05Stalled→03In progress a:03jasmine_ @jasmine_ please resolve this task when done wi... [15:34:11] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:03] (03CR) 10Clément Goubert: wikikube: decommission worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1227431 (https://phabricator.wikimedia.org/T409103) (owner: 10Jasmine) [15:36:54] (03CR) 10Federico Ceratto: [C:03+2] admin: Add dr0ptp4kt to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1227816 (https://phabricator.wikimedia.org/T412875) (owner: 10Federico Ceratto) [15:37:02] (03CR) 10Federico Ceratto: [C:03+2] admin: Add johannesrichterwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1227809 (https://phabricator.wikimedia.org/T414678) (owner: 10Federico Ceratto) [15:41:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:43:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:46:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:51:56] (03PS1) 10Clément Goubert: deployment_server: Fix group name typo [puppet] - 10https://gerrit.wikimedia.org/r/1227829 (https://phabricator.wikimedia.org/T402512) [15:52:32] (03CR) 10Dzahn: [C:03+1] deployment_server: Fix group name typo [puppet] - 10https://gerrit.wikimedia.org/r/1227829 (https://phabricator.wikimedia.org/T402512) (owner: 10Clément Goubert) [15:52:41] (03CR) 10Clément Goubert: [C:03+2] deployment_server: Fix group name typo [puppet] - 10https://gerrit.wikimedia.org/r/1227829 (https://phabricator.wikimedia.org/T402512) (owner: 10Clément Goubert) [15:52:45] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529426 (10ssingh) >>! In T81605#11529119, @cmooney wrote: >>>! In T81605#11522551, @ssingh wrote: >> @cmooney: Any picks for your favourite v6 address for `ns1`? I was thinking of allocating `26... [15:53:21] (03CR) 10Elukey: "I mixed logs in my reports, so far it seems that a POST/PUT (depending on the implementation of the registry) reaches envoy and nothing is" [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [15:58:16] (03CR) 10MVernon: [C:03+1] "well, happy for you to use this for testing/debugging, but we might want to revert once all the bugs are gone (fx: hollow laughter) :)" [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [15:58:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2356.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:03:53] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade any 1 Gbps dse-k8s-worker-eqiad network interfaces to 10 Gbps - https://phabricator.wikimedia.org/T414787#11529476 (10BTullis) Great! Thanks, both. So we potentially have 3 cards that might be suitable, from the `logstash103[3–5]` servers. These are all `R440` hosts. I... [16:04:01] 06SRE, 06Release-Engineering-Team, 10Scap, 06ServiceOps new, and 2 others: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11529477 (10MLechvien-WMF) [16:04:09] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-good-first-task, 10ServiceOps-SharedInfra: docker-registry "Last updated at" time should specify TZ - https://phabricator.wikimedia.org/T404010#11529478 (10Clement_Goubert) [16:04:27] 06SRE, 06Release-Engineering-Team, 06ServiceOps new, 10ServiceOps-good-first-task, 10ServiceOps-SharedInfra: docker-registry "Last updated at" text hiding under scrollbar - https://phabricator.wikimedia.org/T404008#11529480 (10Clement_Goubert) [16:04:41] (03CR) 10JHathaway: [C:03+1] Remove remaining traces of profile::puppet::agent::force_puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1227616 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:05:18] (03CR) 10JHathaway: [C:03+1] Copy yamllint into the puppetserver module and use it [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:05:31] (03CR) 10JHathaway: [C:03+1] Remove puppetmaster spec files [puppet] - 10https://gerrit.wikimedia.org/r/1227698 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:06:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2356.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:09:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529534 (10Papaul) [16:09:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2355.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:11:20] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529540 (10cmooney) >>! In T81605#11529426, @ssingh wrote: > Thanks! The plan is to the same for `eqiad` Ok I've reserved those two ranges/IPs in Netbox now. > Any thoughts on the last one (an... [16:11:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Upgrade any 1 Gbps dse-k8s-worker-eqiad network interfaces to 10 Gbps - https://phabricator.wikimedia.org/T414787#11529541 (10BTullis) [16:17:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2355.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:17:23] (03Abandoned) 10Btullis: Bind the spark-job-orchestration role to the default serviceaccount [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212130 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [16:20:06] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529590 (10ssingh) >>! In T81605#11529540, @cmooney wrote: >>>! In T81605#11529426, @ssingh wrote: >> Thanks! The plan is to the same for `eqiad` > > Ok I've reserved those two ranges/IPs in Ne... [16:20:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529592 (10Papaul) @Clement_Goubert hello can you or someone on your team please add these servers to site.pp with the insetup role? Thanks [16:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:21:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2354.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:21:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529595 (10Papaul) [16:24:09] (03CR) 10Elukey: [C:03+1] ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [16:24:57] (03CR) 10Elukey: [V:03+2 C:03+2] ml: fix vllm's image builder config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1227697 (https://phabricator.wikimedia.org/T385173) (owner: 10Elukey) [16:25:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:25:58] (03PS1) 10Clément Goubert: site.pp: Add wikikube-worker23[32-56] [puppet] - 10https://gerrit.wikimedia.org/r/1227843 (https://phabricator.wikimedia.org/T408757) [16:26:29] (03CR) 10CI reject: [V:04-1] site.pp: Add wikikube-worker23[32-56] [puppet] - 10https://gerrit.wikimedia.org/r/1227843 (https://phabricator.wikimedia.org/T408757) (owner: 10Clément Goubert) [16:26:36] 10ops-codfw, 06SRE, 06DC-Ops, 06SRE Observability (FY2025/2026-Q3): Q2:rack/setup/install mwlog2003 - https://phabricator.wikimedia.org/T412229#11529619 (10Jhancock.wm) a:03Jhancock.wm [16:26:46] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529622 (10Clement_Goubert) >>! In T408757#11529592, @Papaul wrote: > @Clement_Goubert hello can you or someone on your team please add these servers t... [16:26:56] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06ServiceOps new, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#11529623 (10MLechvien-WMF) [16:27:16] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, and 2 others: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529625 (10Clement_Goubert) [16:28:02] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06ServiceOps new, and 3 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#11529628 (10MLechvien-WMF) [16:28:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2354.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:28:45] (03Abandoned) 10Clément Goubert: site.pp: Add wikikube-worker23[32-56] [puppet] - 10https://gerrit.wikimedia.org/r/1227843 (https://phabricator.wikimedia.org/T408757) (owner: 10Clément Goubert) [16:33:57] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06ServiceOps new, and 3 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#11529661 (10MLechvien-WMF) a:03brouberol @brouberol bringing back this task as we're going through Serviceops backlog. Reading the last u... [16:38:08] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529679 (10Jhancock.wm) [16:38:58] (03PS1) 10CDanis: tunnelencabulator: simple IPv6 support [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227846 (https://phabricator.wikimedia.org/T411895) [16:39:50] 06SRE, 10observability, 06serviceops: write some recording rules for queries used in the appserver RED dashboard - https://phabricator.wikimedia.org/T249663#11529688 (10MLechvien-WMF) 05Open→03Invalid This appservers RED dashboard got deprecated in favor of https://grafana.wikimedia.org/d/35WSHOjVk/a... [16:40:05] (03PS2) 10CDanis: tunnelencabulator: simple IPv6 support [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227846 (https://phabricator.wikimedia.org/T414719) [16:40:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2353.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:41:41] 06SRE, 10observability, 06serviceops: write some recording rules for queries used in the appserver RED dashboard - https://phabricator.wikimedia.org/T249663#11529696 (10CDanis) 05Invalid→03Open p:05Medium→03High The appservers RED k8s dashboard makes even heavier queries, and was the trigger of a Tha... [16:42:09] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529705 (10Papaul) [16:46:28] 06SRE, 10observability, 10Prod-Kubernetes, 06ServiceOps new: write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11529714 (10MLechvien-WMF) [16:46:54] (03PS1) 10Federico Ceratto: admin: remove ryankemper's old SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1227848 (https://phabricator.wikimedia.org/T412126) [16:46:59] (03PS5) 10Ssingh: dnsbox: codfw: advertise ns1 IPv6 (2620:0:860:53::/128) [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) [16:47:39] 06SRE, 10observability, 10Prod-Kubernetes, 06ServiceOps new: write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11529722 (10MLechvien-WMF) Thanks and sorry for the wrong triaging. Renaming this task to include the k8s, and moving it... [16:47:40] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11529721 (10FCeratto-WMF) Ryan confirmed on IRC, opening https://gerrit.wikimedia.org/r/c/operations/puppet/+/1227848 [16:47:49] (03CR) 10Ssingh: "@bblack@wikimedia.org, @cmooney@wikimedia.org: I plan to merge this on Monday but could use a review here. Thanks <3" [dns] - 10https://gerrit.wikimedia.org/r/1226904 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:48:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2353.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:48:06] (03CR) 10Ssingh: [V:03+1 C:04-2] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7906/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:48:08] (03PS1) 10Jdlrobson: Add namespace-specific collapsible section handlng for Parsoid mobile [extensions/MobileFrontend] (wmf/1.46.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1227849 (https://phabricator.wikimedia.org/T407815) [16:48:32] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:49:02] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11529736 (10FCeratto-WMF) @dr0ptp4kt the change was deployed, can you please confirm if the access works for you now? [16:49:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2333.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:49:18] (03CR) 10Ssingh: "1) ensure that v6 /128 is allocated (done by Cathal). 2) merge this patch, start advertising v6 and ensure that bird/routers agree (since " [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:49:29] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2334.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:50:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2352.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:50:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2335.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:52:39] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529747 (10Jhancock.wm) [16:52:45] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:53:04] pt1979@cumin2002 provision (PID 4123699) is awaiting input [16:53:24] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2333.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:53:30] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529748 (10Jhancock.wm) [16:53:46] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2336.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:54:20] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2335.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:54:38] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:54:49] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:54:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11529749 (10FCeratto-WMF) [16:55:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11529752 (10FCeratto-WMF) @Johannes_Richter_WMDE i updated the permissions, can you please confirm that you have access now? Thanks [16:56:59] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2334.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:57:42] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2336.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:57:53] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529754 (10Papaul) @Clement_Goubert thanks [17:00:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2352.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:04:59] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:05:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2351.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:05:20] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:05:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2333.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:05:37] (03CR) 10Bartosz Dziewoński: [C:03+1] debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [17:05:40] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529817 (10Papaul) [17:05:41] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2333.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:05:46] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:06:05] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:06:37] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:07:01] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 140 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:07:05] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:08:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:08:40] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:10:02] jhancock@cumin1003 provision (PID 1934421) is awaiting input [17:10:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T413525)', diff saved to https://phabricator.wikimedia.org/P87632 and previous config saved to /var/cache/conftool/dbconfig/20260116-171016-marostegui.json [17:10:22] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:14:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2351.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:15:07] 06SRE: please update astein puppet ssh key - https://phabricator.wikimedia.org/T414830 (10AStein-WMF) 03NEW [17:16:05] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:16:42] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:17:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2350.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:18:58] (03CR) 10BryanDavis: [C:03+1] "When using this new version I am able to do `ssh -6 gerrit -- gerrit show-connections --wide | grep bd808` and see my traffic passing thro" [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227846 (https://phabricator.wikimedia.org/T414719) (owner: 10CDanis) [17:19:40] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:20:12] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:20:22] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529908 (10Papaul) [17:20:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P87633 and previous config saved to /var/cache/conftool/dbconfig/20260116-172025-marostegui.json [17:21:52] (03CR) 10CDanis: [V:03+2 C:03+2] "`./tunnelencabulator --self-test --verbose` reports 17 (all) tests passed" [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1227846 (https://phabricator.wikimedia.org/T414719) (owner: 10CDanis) [17:22:08] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11529911 (10Johannes_Richter_WMDE) >>! In T414678#11529749, @FCeratto-WMF wrote: > @Johannes_Richter_WMDE i updated the permissions, can you please confirm that you... [17:24:11] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:25:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2350.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:25:55] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11529932 (10MoritzMuehlenhoff) [17:27:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:27:49] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:28:05] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:28:31] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:30:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P87634 and previous config saved to /var/cache/conftool/dbconfig/20260116-173033-marostegui.json [17:36:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2349.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:36:54] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529968 (10Papaul) [17:40:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T413525)', diff saved to https://phabricator.wikimedia.org/P87635 and previous config saved to /var/cache/conftool/dbconfig/20260116-174042-marostegui.json [17:40:48] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [17:40:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1244.eqiad.wmnet with reason: Maintenance [17:41:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1244 (T413525)', diff saved to https://phabricator.wikimedia.org/P87636 and previous config saved to /var/cache/conftool/dbconfig/20260116-174107-marostegui.json [17:43:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2349.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:46:02] (03CR) 10Michael Große: [C:03+1] "I think this is now ready to be deployed in the new week, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno) [17:46:28] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11529988 (10BTullis) We currently have four hosts for which this alert is fi... [17:47:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2348.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:48:31] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11529993 (10Papaul) [17:52:50] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1187.eqiad.wmnet [17:53:17] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11530000 (10ops-monitoring-bot) Host an-worker1187.eqiad.wmnet rebooted by b... [17:54:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2348.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:55:36] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530005 (10Papaul) [17:56:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2347.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:56:23] (03CR) 10Ryan Kemper: [C:03+1] admin: remove ryankemper's old SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1227848 (https://phabricator.wikimedia.org/T412126) (owner: 10Federico Ceratto) [17:59:06] pt1979@cumin2002 provision (PID 4156451) is awaiting input [17:59:25] (03PS1) 10Ryan Kemper: wdqs: provide trueg root access [puppet] - 10https://gerrit.wikimedia.org/r/1227862 (https://phabricator.wikimedia.org/T414517) [17:59:36] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530021 (10ayounsi) overall lgtm Using a full /64 unicast `2620:0:860:53::/64` for a single service looks a bit weird, but as it's something critical like AuthDNS it doesn't shock me too much. T... [18:07:19] (03CR) 10BCornwall: [C:03+1] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1226512 (https://phabricator.wikimedia.org/T414543) (owner: 10Gerrit maintenance bot) [18:11:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2347.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:12:00] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530059 (10ssingh) >>! In T81605#11530021, @ayounsi wrote: > overall lgtm > > Using a full /64 unicast `2620:0:860:53::/64` for a single service looks a bit weird, but as it's something critical... [18:15:54] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host an-worker1187.eqiad.wmnet [18:21:54] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530080 (10Papaul) [18:31:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T413525)', diff saved to https://phabricator.wikimedia.org/P87637 and previous config saved to /var/cache/conftool/dbconfig/20260116-183155-marostegui.json [18:32:00] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [18:34:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87638 and previous config saved to /var/cache/conftool/dbconfig/20260116-183445-marostegui.json [18:34:52] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [18:34:53] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [18:38:06] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth1 (Subnet frack-fundraising-codfw in F5) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:42:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P87639 and previous config saved to /var/cache/conftool/dbconfig/20260116-184203-marostegui.json [18:42:55] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:44:10] (03CR) 10Ssingh: "Looks good and thanks for the patch. One question on my mind is that if we should also update geo-maps in the process. I know this is stri" [puppet] - 10https://gerrit.wikimedia.org/r/1218784 (https://phabricator.wikimedia.org/T411617) (owner: 10Cathal Mooney) [18:44:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2333.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:44:50] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2333.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:44:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P87640 and previous config saved to /var/cache/conftool/dbconfig/20260116-184454-marostegui.json [18:45:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2333.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:45:36] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2335.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:46:28] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2336.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:49:54] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2332.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:52:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246', diff saved to https://phabricator.wikimedia.org/P87641 and previous config saved to /var/cache/conftool/dbconfig/20260116-185212-marostegui.json [18:52:52] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2335.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:53:20] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11530186 (10simon04) I wonder whether any documentation need to be updated, for instance... - https://www.mediawiki.org/wiki/Help:Im... [18:53:47] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2336.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:54:48] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2333.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:55:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P87642 and previous config saved to /var/cache/conftool/dbconfig/20260116-185502-marostegui.json [18:55:42] (03CR) 10Cathal Mooney: "I think this is better discussed on task so any decision is easier to find in future. Your point totally makes sense, I guess I only look" [puppet] - 10https://gerrit.wikimedia.org/r/1218784 (https://phabricator.wikimedia.org/T411617) (owner: 10Cathal Mooney) [18:56:22] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2337.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:56:45] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2338.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:57:16] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2339.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:57:45] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2340.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [18:58:21] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2341.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:01:14] (03CR) 10Ssingh: "Thanks, adding it there so we can get more input." [puppet] - 10https://gerrit.wikimedia.org/r/1218784 (https://phabricator.wikimedia.org/T411617) (owner: 10Cathal Mooney) [19:02:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2246 (T413525)', diff saved to https://phabricator.wikimedia.org/P87643 and previous config saved to /var/cache/conftool/dbconfig/20260116-190220-marostegui.json [19:02:26] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530209 (10BBlack) >>! In T81605#11529143, @cmooney wrote: >>>! In T81605#11518553, @ssingh wrote: >> Our glue records also have a disparity. > > I was interested to know what effect this would... [19:02:26] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [19:02:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2247.codfw.wmnet with reason: Maintenance [19:02:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2247 (T413525)', diff saved to https://phabricator.wikimedia.org/P87644 and previous config saved to /var/cache/conftool/dbconfig/20260116-190235-marostegui.json [19:03:43] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2337.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:03:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11530223 (10VRiley-WMF) @BTullis No, the drive wasn't replaced. I was waiting until a full go-ahead form you. I saw the reboot, but was... [19:04:06] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2338.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:04:39] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2342.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:04:50] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2339.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:04:56] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2340.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:05:08] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2343.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:05:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87645 and previous config saved to /var/cache/conftool/dbconfig/20260116-190510-marostegui.json [19:05:25] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:05:26] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2344.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:05:26] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:05:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [19:05:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2341.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:05:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1184 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87646 and previous config saved to /var/cache/conftool/dbconfig/20260116-190539-marostegui.json [19:06:00] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker2345.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:08:00] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530263 (10Jhancock.wm) [19:11:50] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2342.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:12:23] jhancock@cumin1003 provision (PID 1950397) is awaiting input [19:12:32] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2345 [19:12:33] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2343.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:12:40] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2344.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:12:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2345 [19:13:46] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530280 (10Jhancock.wm) [19:14:07] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker2345.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:14:09] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530281 (10Jhancock.wm) wikikube-worker2345 is not provisioning. will check next time on site. [19:17:34] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2332.codfw.wmnet with OS bookworm [19:17:43] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2332.codf... [19:18:17] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2333.codfw.wmnet with OS bookworm [19:18:32] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530284 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2333.codf... [19:18:38] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2334.codfw.wmnet with OS bookworm [19:18:46] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530285 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2334.codf... [19:29:08] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2332.codfw.wmnet with reason: host reimage [19:29:29] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2333.codfw.wmnet with reason: host reimage [19:29:41] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2334.codfw.wmnet with reason: host reimage [19:34:11] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2332.codfw.wmnet with reason: host reimage [19:38:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2333.codfw.wmnet with reason: host reimage [19:42:12] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2334.codfw.wmnet with reason: host reimage [19:44:03] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:52:36] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [19:53:14] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [19:53:15] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2332.codfw.wmnet with OS bookworm [19:53:23] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530332 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2332.codfw.wm... [19:55:32] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [19:55:47] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [19:55:48] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2333.codfw.wmnet with OS bookworm [19:55:58] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530345 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2333.codfw.wm... [19:56:11] (03CR) 10Bking: [C:03+1] wdqs: provide trueg root access [puppet] - 10https://gerrit.wikimedia.org/r/1227862 (https://phabricator.wikimedia.org/T414517) (owner: 10Ryan Kemper) [20:00:50] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:01:08] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:01:09] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2334.codfw.wmnet with OS bookworm [20:01:23] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530350 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2334.codfw.wm... [20:02:02] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530351 (10Jhancock.wm) [20:02:46] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2335.codfw.wmnet with OS bookworm [20:02:55] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2335.codf... [20:03:02] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2336.codfw.wmnet with OS bookworm [20:03:13] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2336.codf... [20:03:15] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2337.codfw.wmnet with OS bookworm [20:03:24] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530358 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2337.codf... [20:14:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2336.codfw.wmnet with reason: host reimage [20:14:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2335.codfw.wmnet with reason: host reimage [20:14:45] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2337.codfw.wmnet with reason: host reimage [20:15:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:17:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Upgrade any 1 Gbps dse-k8s-worker-eqiad network interfaces to 10 Gbps - https://phabricator.wikimedia.org/T414787#11530391 (10RobH) So there are a few considerations here that I can see: * I don't want to put our actual budget fi... [20:17:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:18:38] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2336.codfw.wmnet with reason: host reimage [20:20:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:21:47] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530397 (10cmooney) >>! In T81605#11530209, @BBlack wrote: > Except almost nobody but engineers are going to directly query that record. Most caches will learn and re-learn it as they traverse t... [20:22:27] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2337.codfw.wmnet with reason: host reimage [20:22:34] (03CR) 10Jforrester: add abstract.wikipedia.org to section for wikis not covered by langlist (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1227706 (https://phabricator.wikimedia.org/T411724) (owner: 10Dzahn) [20:26:54] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2335.codfw.wmnet with reason: host reimage [20:28:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:33:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:36:10] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:37:16] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:37:17] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2336.codfw.wmnet with OS bookworm [20:37:25] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530445 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2336.codfw.wm... [20:40:00] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:42:03] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:42:05] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2337.codfw.wmnet with OS bookworm [20:42:12] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530453 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2337.codfw.wm... [20:44:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11530455 (10RobH) Correction: >>! In T403035#11518969, @cmooney wrote: > > WMF Mgmt Network Links `E15`: > > |Device 1|Front Port|Logical Int|Device... [20:44:34] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:45:45] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [20:45:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2335.codfw.wmnet with OS bookworm [20:46:01] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2335.codfw.wm... [20:46:53] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530458 (10Jhancock.wm) [20:48:34] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2338.codfw.wmnet with OS bookworm [20:48:48] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2338.codf... [20:48:51] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2339.codfw.wmnet with OS bookworm [20:49:00] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530464 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2339.codf... [20:49:05] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2340.codfw.wmnet with OS bookworm [20:49:14] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530465 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2340.codf... [20:50:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11530466 (10RobH) [20:56:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11530492 (10RobH) [20:59:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11530494 (10RobH) [20:59:52] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2338.codfw.wmnet with reason: host reimage [21:00:24] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2340.codfw.wmnet with reason: host reimage [21:00:26] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2339.codfw.wmnet with reason: host reimage [21:03:00] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2338.codfw.wmnet with reason: host reimage [21:06:17] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2339.codfw.wmnet with reason: host reimage [21:12:34] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530539 (10jeremyb) >>! In T81605#11530397, @cmooney wrote: > As a further test I wiped my cache, started a packet capture and did a dig for '//en.wikimedia.org//'. did you intend to use a non-c... [21:13:57] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2340.codfw.wmnet with reason: host reimage [21:18:11] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11530547 (10KFrancis) Hi @Johannnes89 To process your volunteer NDA, I'll need your home mailing address and your personal email address. Please send that information to kfranci... [21:20:19] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [21:20:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [21:20:39] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2338.codfw.wmnet with OS bookworm [21:20:51] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530556 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2338.codfw.wm... [21:24:04] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [21:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:24:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T413525)', diff saved to https://phabricator.wikimedia.org/P87647 and previous config saved to /var/cache/conftool/dbconfig/20260116-212411-marostegui.json [21:24:16] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [21:24:57] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [21:24:58] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2339.codfw.wmnet with OS bookworm [21:25:09] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2339.codfw.wm... [21:31:50] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [21:32:08] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [21:32:09] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2340.codfw.wmnet with OS bookworm [21:32:18] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530587 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2340.codfw.wm... [21:33:14] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530590 (10Jhancock.wm) [21:34:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P87648 and previous config saved to /var/cache/conftool/dbconfig/20260116-213419-marostegui.json [21:35:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2341.codfw.wmnet with OS bookworm [21:35:18] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2341.codf... [21:35:18] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2342.codfw.wmnet with OS bookworm [21:35:28] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2343.codfw.wmnet with OS bookworm [21:35:29] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530596 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2342.codf... [21:35:36] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2343.codf... [21:37:17] 10SRE-Access-Requests: Requesting `analytics-admins` access for AKhatun - https://phabricator.wikimedia.org/T414846 (10AKhatun_WMF) 03NEW [21:38:42] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11530616 (10Johannnes89) >>! In T414789#11530547, @KFrancis wrote: > Hi @Johannnes89 To process your volunteer NDA, I'll need your home mailing address and your personal email ad... [21:41:55] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530621 (10BBlack) >>! In T81605#11530397, @cmooney wrote: > But it seems Bind does not cache the glue records / additional that comes back from the .org authdns. At least for any length of time... [21:44:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P87649 and previous config saved to /var/cache/conftool/dbconfig/20260116-214427-marostegui.json [21:46:31] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2341.codfw.wmnet with reason: host reimage [21:46:33] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2342.codfw.wmnet with reason: host reimage [21:46:43] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2343.codfw.wmnet with reason: host reimage [21:50:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2341.codfw.wmnet with reason: host reimage [21:54:01] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2342.codfw.wmnet with reason: host reimage [21:54:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T413525)', diff saved to https://phabricator.wikimedia.org/P87650 and previous config saved to /var/cache/conftool/dbconfig/20260116-215436-marostegui.json [21:54:41] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [21:54:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1245.eqiad.wmnet with reason: Maintenance [21:58:18] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2343.codfw.wmnet with reason: host reimage [22:08:52] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:09:22] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:09:23] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2341.codfw.wmnet with OS bookworm [22:09:37] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530672 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2341.codfw.wm... [22:13:06] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:15:25] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:15:27] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2342.codfw.wmnet with OS bookworm [22:15:33] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:15:41] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530678 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2342.codfw.wm... [22:15:57] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:15:58] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2343.codfw.wmnet with OS bookworm [22:16:08] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2343.codfw.wm... [22:20:44] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530693 (10Jhancock.wm) [22:22:52] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2344.codfw.wmnet with OS bookworm [22:23:05] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530697 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2344.codf... [22:23:10] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2347.codfw.wmnet with OS bookworm [22:23:21] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2347.codf... [22:23:25] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker2348.codfw.wmnet with OS bookworm [22:23:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87651 and previous config saved to /var/cache/conftool/dbconfig/20260116-222326-marostegui.json [22:23:33] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530700 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host wikikube-worker2348.codf... [22:23:35] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:23:35] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:33:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P87652 and previous config saved to /var/cache/conftool/dbconfig/20260116-223334-marostegui.json [22:34:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2344.codfw.wmnet with reason: host reimage [22:34:06] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2347.codfw.wmnet with reason: host reimage [22:34:51] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2348.codfw.wmnet with reason: host reimage [22:38:50] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2344.codfw.wmnet with reason: host reimage [22:42:03] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2348.codfw.wmnet with reason: host reimage [22:42:58] (03PS1) 10Ryan Kemper: wdqs: fix typo for wdqs1026 [puppet] - 10https://gerrit.wikimedia.org/r/1227929 [22:43:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P87653 and previous config saved to /var/cache/conftool/dbconfig/20260116-224343-marostegui.json [22:44:17] (03CR) 10Bking: [C:03+2] wdqs: fix typo for wdqs1026 [puppet] - 10https://gerrit.wikimedia.org/r/1227929 (owner: 10Ryan Kemper) [22:46:52] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2347.codfw.wmnet with reason: host reimage [22:53:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87654 and previous config saved to /var/cache/conftool/dbconfig/20260116-225351-marostegui.json [22:53:59] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:53:59] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:54:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [22:54:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87655 and previous config saved to /var/cache/conftool/dbconfig/20260116-225416-marostegui.json [22:56:19] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:57:54] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:57:55] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2344.codfw.wmnet with OS bookworm [22:58:05] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530759 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2344.codfw.wm... [22:58:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1026:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:59:14] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:59:32] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [22:59:33] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2348.codfw.wmnet with OS bookworm [22:59:45] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2348.codfw.wm... [23:01:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T413525)', diff saved to https://phabricator.wikimedia.org/P87656 and previous config saved to /var/cache/conftool/dbconfig/20260116-230134-marostegui.json [23:01:40] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:04:09] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [23:04:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [23:04:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2347.codfw.wmnet with OS bookworm [23:04:40] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host wikikube-worker2347.codfw.wm... [23:06:25] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11530790 (10Jhancock.wm) [23:11:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P87657 and previous config saved to /var/cache/conftool/dbconfig/20260116-231143-marostegui.json [23:21:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P87658 and previous config saved to /var/cache/conftool/dbconfig/20260116-232151-marostegui.json [23:31:02] (03PS1) 10Dduvall: admin: remove old non-fido keys for dduvall [puppet] - 10https://gerrit.wikimedia.org/r/1227949 (https://phabricator.wikimedia.org/T414619) [23:32:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T413525)', diff saved to https://phabricator.wikimedia.org/P87659 and previous config saved to /var/cache/conftool/dbconfig/20260116-233200-marostegui.json [23:32:06] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [23:32:49] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO access for dduvall - https://phabricator.wikimedia.org/T414619#11530833 (10dduvall) I've verified production access using my new FIDO SSH keys and submitted a patch to remove my old key. [23:33:01] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO access for dduvall - https://phabricator.wikimedia.org/T414619#11530836 (10dduvall) [23:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:39:50] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530844 (10cmooney) >>! In T81605#11530539, @jeremyb wrote: > did you intend to use a non-canonical domain here? pedia vs media. Ah sorry that was a typo, corrected now. I looked up //en.wikipe...