[00:40:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1228299 [00:40:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1228299 (owner: 10TrainBranchBot) [00:54:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1228299 (owner: 10TrainBranchBot) [00:59:06] (03PS1) 10Ladsgroup: cassandra: Drop departed staff db [puppet] - 10https://gerrit.wikimedia.org/r/1228300 [01:00:58] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1228301 [01:10:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1228301 (owner: 10TrainBranchBot) [01:14:26] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 28s) [01:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:35:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1228301 (owner: 10TrainBranchBot) [02:07:54] (03CR) 10Scott French: "Thank you very much for moving this forward!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227799 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [02:14:55] (03CR) 10Scott French: [C:03+1] docker_registry: allor to set the loglevel for an instance [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [02:39:17] PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:41:17] RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:07:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87725 and previous config saved to /var/cache/conftool/dbconfig/20260119-030716-marostegui.json [03:07:23] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:07:23] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:17:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P87726 and previous config saved to /var/cache/conftool/dbconfig/20260119-031725-marostegui.json [03:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:27:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P87727 and previous config saved to /var/cache/conftool/dbconfig/20260119-032733-marostegui.json [03:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:37:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87728 and previous config saved to /var/cache/conftool/dbconfig/20260119-033742-marostegui.json [03:37:48] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:37:49] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:37:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [03:38:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87729 and previous config saved to /var/cache/conftool/dbconfig/20260119-033806-marostegui.json [05:09:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:30:08] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:15:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87730 and previous config saved to /var/cache/conftool/dbconfig/20260119-061506-marostegui.json [06:15:13] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:15:13] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:24:55] (03CR) 10Pppery: "Your next step (after responding to my comment below) is to schedule this for a backport window - see https://wikitech.wikimedia.org/wiki/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [06:24:59] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [06:25:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P87731 and previous config saved to /var/cache/conftool/dbconfig/20260119-062514-marostegui.json [06:28:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1262.eqiad.wmnet with reason: Maintenance [06:28:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1262 (T413525)', diff saved to https://phabricator.wikimedia.org/P87732 and previous config saved to /var/cache/conftool/dbconfig/20260119-062813-marostegui.json [06:28:18] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [06:29:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2248.codfw.wmnet with reason: Maintenance [06:29:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2248 (T413525)', diff saved to https://phabricator.wikimedia.org/P87733 and previous config saved to /var/cache/conftool/dbconfig/20260119-062926-marostegui.json [06:35:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P87734 and previous config saved to /var/cache/conftool/dbconfig/20260119-063522-marostegui.json [06:45:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87735 and previous config saved to /var/cache/conftool/dbconfig/20260119-064531-marostegui.json [06:45:37] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:45:38] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:45:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [06:45:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87736 and previous config saved to /var/cache/conftool/dbconfig/20260119-064555-marostegui.json [07:04:35] (03CR) 10Ayounsi: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [07:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:33:16] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Lower resource usage for article-descriptions on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227736 (https://phabricator.wikimedia.org/T414431) (owner: 10Bartosz Wójtowicz) [07:35:14] (03Merged) 10jenkins-bot: ml-services: Lower resource usage for article-descriptions on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227736 (https://phabricator.wikimedia.org/T414431) (owner: 10Bartosz Wójtowicz) [07:44:48] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [07:51:01] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster spec files [puppet] - 10https://gerrit.wikimedia.org/r/1227698 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [07:52:08] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::admin::groups from old mediawiki roles [puppet] - 10https://gerrit.wikimedia.org/r/1227744 (owner: 10Muehlenhoff) [08:00:04] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T0800). [08:00:04] Seawolf35: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:22] here [08:03:45] (03CR) 10Muehlenhoff: [C:03+2] Remove mwdebuggers group [puppet] - 10https://gerrit.wikimedia.org/r/1227745 (owner: 10Muehlenhoff) [08:11:23] (03CR) 10Brouberol: Define the airflow-sre public and internal domains (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1227731 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [08:11:29] (03CR) 10Brouberol: [C:03+2] Define the airflow-sre public and internal domains [dns] - 10https://gerrit.wikimedia.org/r/1227731 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [08:11:33] (03CR) 10Muehlenhoff: [C:03+2] admin: remove old non-fido keys for dduvall [puppet] - 10https://gerrit.wikimedia.org/r/1227949 (https://phabricator.wikimedia.org/T414619) (owner: 10Dduvall) [08:11:46] !log brouberol@dns1004 START - running authdns-update [08:13:32] (03CR) 10Brouberol: [C:03+2] "Oops, sorry, I had committed the suggested change, forgot to `git review` and then +2ed this one. I'll submit the changes in a separate pa" [dns] - 10https://gerrit.wikimedia.org/r/1227731 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [08:14:15] (03PS1) 10Brouberol: Fix formatting of airflow-sre domain declarations [dns] - 10https://gerrit.wikimedia.org/r/1228315 (https://phabricator.wikimedia.org/T402512) [08:14:50] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO access for dduvall - https://phabricator.wikimedia.org/T414619#11532399 (10MoritzMuehlenhoff) [08:15:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO access for dduvall - https://phabricator.wikimedia.org/T414619#11532400 (10MoritzMuehlenhoff) 05Open→03Resolved All done :-) [08:15:10] (03CR) 10Brouberol: [C:03+2] Fix formatting of airflow-sre domain declarations [dns] - 10https://gerrit.wikimedia.org/r/1228315 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [08:15:18] !log brouberol@dns1004 START - running authdns-update [08:16:24] !log brouberol@dns1004 END - running authdns-update [08:16:37] (03CR) 10Elukey: [C:03+2] docker_registry: allor to set the loglevel for an instance [puppet] - 10https://gerrit.wikimedia.org/r/1227705 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [08:19:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:20:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:21:22] (03CR) 10Elukey: [C:03+2] role::cephadm::rgw: enable access logs for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [08:27:44] (03CR) 10Elukey: [C:03+2] "To keep archives happy - I just realized that this patch enable "only" HTTP 500+ logs:" [puppet] - 10https://gerrit.wikimedia.org/r/1227759 (https://phabricator.wikimedia.org/T394476) (owner: 10Elukey) [08:34:30] (03PS1) 10Brouberol: dse-k8s-eqiad: add the airflow-sre to the ceph/PG operator tenant ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228420 (https://phabricator.wikimedia.org/T402512) [08:34:32] (03PS1) 10Brouberol: dse-k8s-eqiad: define the postgresql-airflow-sre service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228421 (https://phabricator.wikimedia.org/T402512) [08:34:34] (03PS1) 10Brouberol: dse-k8s-eqiad: define the airflow-sre service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228422 (https://phabricator.wikimedia.org/T402512) [08:35:24] (03CR) 10Elukey: [C:03+1] "Limited knowledge but LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228420 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [08:36:14] (03CR) 10Elukey: [C:03+1] "Limited knowledge but LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228421 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [08:36:36] (03CR) 10Muehlenhoff: [C:03+2] Rename enc_client and move under puppetserver [puppet] - 10https://gerrit.wikimedia.org/r/1227618 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:36:52] (03CR) 10Elukey: [C:03+1] "Limited knowledge but LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228422 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [08:40:26] (03CR) 10Jcrespo: [C:03+2] backup: Setup ms-backup[12]00[34] [puppet] - 10https://gerrit.wikimedia.org/r/1227789 (https://phabricator.wikimedia.org/T414717) (owner: 10Jcrespo) [08:42:03] (03CR) 10Muehlenhoff: [C:03+2] Move validatecloudvpsfqdn.py out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1227694 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:43:57] FIRING: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:44:46] (03PS3) 10Jcrespo: backup: Set up backup1015-backup1020 & backup2015-backup2020 [puppet] - 10https://gerrit.wikimedia.org/r/1227792 (https://phabricator.wikimedia.org/T414728) [08:45:54] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:46:37] !log dpogorzelski@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:47:23] !log continue asw1-b12-drmrs troubleshooting - T413181 [08:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:27] T413181: asw1-b12-drmrs stopped reporting metrics - https://phabricator.wikimedia.org/T413181 [08:49:02] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:49:40] (03CR) 10Jcrespo: [C:03+2] backup: Set up backup1015-backup1020 & backup2015-backup2020 [puppet] - 10https://gerrit.wikimedia.org/r/1227792 (https://phabricator.wikimedia.org/T414728) (owner: 10Jcrespo) [08:50:44] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:51:58] (03PS6) 10Kosta Harlan: (WIP) IPReputation: Enable OpenSearch IPoid provider on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) [08:53:17] (03CR) 10Dpogorzelski: [C:03+2] ml-build: add missing configs [puppet] - 10https://gerrit.wikimedia.org/r/1227743 (owner: 10Dpogorzelski) [08:57:19] (03CR) 10Vgutierrez: trafficserver: Send /ins-502b/v2/events to intake-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [08:59:13] (03CR) 10Muehlenhoff: [C:03+2] Remove remaining traces of profile::puppet::agent::force_puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1227616 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:08:11] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: add the airflow-sre to the ceph/PG operator tenant ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228420 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [09:14:18] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11532568 (10elukey) After adding the envoy access logs (they do log only HTTP 500+ requests though): ` [2026-01-19T09:04:41.009Z] "PUT /registry-restricted/docker/... [09:15:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:16:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:16:57] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11532571 (10MatthewVernon) There is probably further documentation improvement; the former I've updated with the new set of standard... [09:18:57] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:19:45] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11532578 (10JAllemandou) Unfortunately the problem is not solved as shown in [[ https://grafana.wikime... [09:21:56] (03CR) 10Muehlenhoff: [C:03+2] conf/eqiad: Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1182694 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [09:22:21] (03PS1) 10Filippo Giunchedi: admin: remove non-FIDO ssh key for filippo [puppet] - 10https://gerrit.wikimedia.org/r/1228424 [09:22:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11532598 (10jcrespo) [09:22:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-backup100[34] - https://phabricator.wikimedia.org/T414718#11532599 (10jcrespo) a:05jcrespo→03None [09:22:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1228424 (owner: 10Filippo Giunchedi) [09:23:20] (03CR) 10Filippo Giunchedi: [C:03+2] admin: remove non-FIDO ssh key for filippo [puppet] - 10https://gerrit.wikimedia.org/r/1228424 (owner: 10Filippo Giunchedi) [09:23:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11532603 (10jcrespo) [09:23:55] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install ms-backup200[34] - https://phabricator.wikimedia.org/T414717#11532604 (10jcrespo) a:05jcrespo→03None [09:24:05] moritzm: merged your change too [09:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:25:15] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11532606 (10jcrespo) [09:26:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11532607 (10jcrespo) [09:30:49] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: define the postgresql-airflow-sre service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228421 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [09:31:05] (03PS16) 10Vgutierrez: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [09:31:20] godog: ack, thx [09:32:38] 06SRE, 06Infrastructure-Foundations, 10netops: Offline script - adjust to work with fundraising - https://phabricator.wikimedia.org/T414321#11532621 (10ayounsi) a:05cmooney→03Jclark-ctr @Jclark-ctr we had a look at the decom cookbook and offline script without seeing any smoking gun on why it would misbe... [09:32:48] (03Merged) 10jenkins-bot: dse-k8s-eqiad: define the postgresql-airflow-sre service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228421 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [09:33:07] (03PS2) 10Brouberol: dse-k8s-eqiad: define the airflow-sre service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228422 (https://phabricator.wikimedia.org/T402512) [09:33:09] (03CR) 10Vgutierrez: [C:03+2] cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [09:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [09:35:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [09:35:57] (03CR) 10Vgutierrez: [C:03+2] cache::upload: introduce rate-limits by traffic class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [09:38:18] (03PS1) 10Kevin Bazira: ml-services: rr-wikidata horizontal scaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228426 (https://phabricator.wikimedia.org/T414060) [09:40:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T413525)', diff saved to https://phabricator.wikimedia.org/P87737 and previous config saved to /var/cache/conftool/dbconfig/20260119-094048-marostegui.json [09:40:53] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [09:47:28] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Requesting `analytics-admins` access for AKhatun - https://phabricator.wikimedia.org/T414846#11532801 (10BTullis) a:03BTullis [09:50:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P87738 and previous config saved to /var/cache/conftool/dbconfig/20260119-095055-marostegui.json [09:51:28] (03PS1) 10Btullis: Add akhatun to the analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1228429 (https://phabricator.wikimedia.org/T414846) [09:54:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for johannesrichterwmde - https://phabricator.wikimedia.org/T414678#11532920 (10FCeratto-WMF) 05Open→03Resolved a:03FCeratto-WMF Thanks, closing task. [09:54:55] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 13Patch-For-Review: Requesting `analytics-admins` access for AKhatun - https://phabricator.wikimedia.org/T414846#11532925 (10BTullis) [09:55:27] (03CR) 10Btullis: [C:04-1] "Awaiting manager approval on the ticket." [puppet] - 10https://gerrit.wikimedia.org/r/1228429 (https://phabricator.wikimedia.org/T414846) (owner: 10Btullis) [09:55:48] (03CR) 10Brouberol: [C:03+1] "LGTM (I'm no manager)" [puppet] - 10https://gerrit.wikimedia.org/r/1228429 (https://phabricator.wikimedia.org/T414846) (owner: 10Btullis) [09:56:01] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11532932 (10elukey) I think this is probably related to some weird state the the bucket is in: ` elukey@stat1010:~$ s3cmd del s3://registry-restricted/docker/regis... [09:56:37] PROBLEM - Confd vcl based reload on cp7009 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:57:33] PROBLEM - Confd vcl based reload on cp7015 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [09:57:49] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 13Patch-For-Review: Requesting `analytics-admins` access for AKhatun - https://phabricator.wikimedia.org/T414846#11532966 (10BTullis) @Ahoelzl - We just require your approval before continuing. Thanks. [10:00:03] (03PS3) 10Brouberol: dse-k8s-eqiad: define the airflow-sre service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228422 (https://phabricator.wikimedia.org/T402512) [10:01:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P87739 and previous config saved to /var/cache/conftool/dbconfig/20260119-100103-marostegui.json [10:01:56] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11532993 (10elukey) Finally something that makes sense - on stat1010 I tried to upload a super small fine (a txt file with a date) and this is the result: ` elukey... [10:04:33] PROBLEM - Confd vcl based reload on cp7013 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [10:08:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T413525)', diff saved to https://phabricator.wikimedia.org/P87740 and previous config saved to /var/cache/conftool/dbconfig/20260119-100852-marostegui.json [10:08:57] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:11:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T413525)', diff saved to https://phabricator.wikimedia.org/P87741 and previous config saved to /var/cache/conftool/dbconfig/20260119-101111-marostegui.json [10:11:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2240.codfw.wmnet with reason: Maintenance [10:11:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2240 (T413525)', diff saved to https://phabricator.wikimedia.org/P87742 and previous config saved to /var/cache/conftool/dbconfig/20260119-101136-marostegui.json [10:12:43] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest2003.codfw.wmnet [10:14:25] 10ops-eqiad, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T414939 (10phaultfinder) 03NEW [10:17:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno) [10:19:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P87743 and previous config saved to /var/cache/conftool/dbconfig/20260119-101901-marostegui.json [10:22:21] (03CR) 10Anzx: [C:04-1] "above configuration changes is subjected only to commons wiki, so bot usergroup on enwikiquote doesn't have `editautopatrolprotected`, in " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [10:24:06] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2003.codfw.wmnet [10:27:11] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: define the airflow-sre service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228422 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:29:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P87744 and previous config saved to /var/cache/conftool/dbconfig/20260119-102909-marostegui.json [10:29:21] !log restart apus rgws in eqiad [10:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:36] (03CR) 10Federico Ceratto: [C:03+2] admin: remove ryankemper's old SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1227848 (https://phabricator.wikimedia.org/T412126) (owner: 10Federico Ceratto) [10:31:56] (03CR) 10Anzx: [C:03+1] "my bad done in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/0c2fb70b914d8bbcb72c4af16d111e6e35682eb8%5E%21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [10:32:45] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11533119 (10FCeratto-WMF) [10:32:52] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11533120 (10FCeratto-WMF) 05Stalled→03Resolved [10:33:35] 06SRE, 06ServiceOps new: requestctl support to enable/disable ipblocks - https://phabricator.wikimedia.org/T404591#11533122 (10JMeybohm) 05Open→03Resolved [10:34:40] (03CR) 10Anzx: [C:03+1] "bot already have editautopatrolprotected ref: 0c2fb70" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [10:35:24] 06SRE, 06ServiceOps new, 07Epic: FY 25/26 WE 5.4.2: Known bots / clients - https://phabricator.wikimedia.org/T400100#11533127 (10JMeybohm) [10:37:25] (03CR) 10Muehlenhoff: [C:03+2] Copy yamllint into the puppetserver module and use it [puppet] - 10https://gerrit.wikimedia.org/r/1227702 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:37:47] PROBLEM - Confd vcl based reload on cp4045 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [10:38:20] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11533132 (10elukey) [10:39:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T413525)', diff saved to https://phabricator.wikimedia.org/P87745 and previous config saved to /var/cache/conftool/dbconfig/20260119-103917-marostegui.json [10:39:19] PROBLEM - Confd vcl based reload on cp4046 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [10:39:22] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [10:39:42] fabfur: ^^ is that you? [10:40:50] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11533155 (10FCeratto-WMF) 05Open→03In progress [10:41:02] (03PS1) 10Brouberol: dse-k8s-eqiad/airflow-sre: define deploy role and TLS SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228437 (https://phabricator.wikimedia.org/T402512) [10:41:19] PROBLEM - Confd vcl based reload on cp4048 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [10:41:48] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:41:59] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11533161 (10ayounsi) > Just to understand this point - do you mean that their firmware doesn't expose them because it is old etc.. or because they are supermic... [10:42:56] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Network is hard down on an-worker1160.eqiad.wmnet - https://phabricator.wikimedia.org/T414942 (10brouberol) 03NEW [10:43:05] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Network is hard down on an-worker1160.eqiad.wmnet - https://phabricator.wikimedia.org/T414942#11533172 (10brouberol) p:05Triage→03High [10:43:19] PROBLEM - Confd vcl based reload on cp4049 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [10:43:48] (03CR) 10Joal: [C:04-1] dse-k8s-eqiad/airflow-sre: define deploy role and TLS SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228437 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:44:19] PROBLEM - Confd vcl based reload on cp4050 is CRITICAL: reload-vcl failed to run since 0h, 2 minutes. https://wikitech.wikimedia.org/wiki/Varnish [10:44:51] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11533174 (10FCeratto-WMF) Hello @Milimetric @Ahoelzl @Ottomata - can you please review this access request for `analytics-privatedata-users`? Thanks [10:47:02] (03CR) 10Joal: [C:03+1] dse-k8s-eqiad/airflow-sre: define deploy role and TLS SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228437 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:47:12] (03PS1) 10Majavah: P:openstack: enc_client: Fix file path [puppet] - 10https://gerrit.wikimedia.org/r/1228440 [10:48:50] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad/airflow-sre: define deploy role and TLS SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228437 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [10:49:03] (03CR) 10CI reject: [V:04-1] P:openstack: enc_client: Fix file path [puppet] - 10https://gerrit.wikimedia.org/r/1228440 (owner: 10Majavah) [10:49:47] (03PS2) 10Majavah: P:openstack: enc_client: Fix file path [puppet] - 10https://gerrit.wikimedia.org/r/1228440 [10:51:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:52:16] (03CR) 10Filippo Giunchedi: [C:03+1] P:openstack: enc_client: Fix file path [puppet] - 10https://gerrit.wikimedia.org/r/1228440 (owner: 10Majavah) [10:52:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:54:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:55:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:56:03] (03PS16) 10Daniel Kinzler: charts: add redioscope chart and service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207256 (https://phabricator.wikimedia.org/T407999) [10:56:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [10:57:46] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1228440 (owner: 10Majavah) [10:57:47] RECOVERY - Confd vcl based reload on cp4045 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [10:58:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [10:58:17] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:59:17] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:59:28] (03PS3) 10Brouberol: trafficserver: setup caching and ATS rules to publicly expose airflow-sre.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) [10:59:33] (03CR) 10Brouberol: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T1100) [11:00:19] RECOVERY - Confd vcl based reload on cp4050 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:00:50] (03CR) 10Brouberol: [C:03+2] trafficserver: setup caching and ATS rules to publicly expose airflow-sre.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1227733 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [11:01:17] (03PS1) 10Joal: Grow walStorage on dse-k8s pg_airflow_search [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228445 (https://phabricator.wikimedia.org/T411992) [11:01:17] RECOVERY - Confd vcl based reload on cp4049 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:01:19] RECOVERY - Confd vcl based reload on cp4048 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:02:17] RECOVERY - Confd vcl based reload on cp4046 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:02:33] RECOVERY - Confd vcl based reload on cp7015 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:02:33] RECOVERY - Confd vcl based reload on cp7013 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:02:35] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11533214 (10MatthewVernon) After the deletion of objects from registry-restricted (from both eqiad and codfw) late last week, we were stuck with sync being... [11:02:52] (03CR) 10Brouberol: Grow walStorage on dse-k8s pg_airflow_search (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228445 (https://phabricator.wikimedia.org/T411992) (owner: 10Joal) [11:03:37] RECOVERY - Confd vcl based reload on cp7009 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [11:04:18] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11533218 (10MatthewVernon) [if that report is wrong, then probably a full re-sync is required :-/ ] [11:04:47] (03PS1) 10Brouberol: Provision dummy oidc client secret for airflow_sre [labs/private] - 10https://gerrit.wikimedia.org/r/1228446 [11:05:08] (03CR) 10Brouberol: [C:03+2] Provision dummy oidc client secret for airflow_sre [labs/private] - 10https://gerrit.wikimedia.org/r/1228446 (owner: 10Brouberol) [11:05:10] (03CR) 10Brouberol: [V:03+2 C:03+2] Provision dummy oidc client secret for airflow_sre [labs/private] - 10https://gerrit.wikimedia.org/r/1228446 (owner: 10Brouberol) [11:07:17] (03PS1) 10Brouberol: Provision the OIDC config for airflow-sre [puppet] - 10https://gerrit.wikimedia.org/r/1228448 (https://phabricator.wikimedia.org/T402512) [11:12:22] (03PS2) 10Joal: Grow storage on dse-k8s pg_airflow_search [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228445 (https://phabricator.wikimedia.org/T411992) [11:12:48] (03CR) 10Brouberol: [C:03+1] Grow storage on dse-k8s pg_airflow_search (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228445 (https://phabricator.wikimedia.org/T411992) (owner: 10Joal) [11:13:00] (03CR) 10Joal: "New patch sent" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228445 (https://phabricator.wikimedia.org/T411992) (owner: 10Joal) [11:13:08] (03CR) 10Btullis: [C:03+1] Provision the OIDC config for airflow-sre [puppet] - 10https://gerrit.wikimedia.org/r/1228448 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [11:15:53] (03PS1) 10Muehlenhoff: Remove ferm/rsync/tcpircbot settings [puppet] - 10https://gerrit.wikimedia.org/r/1228454 (https://phabricator.wikimedia.org/T397017) [11:17:32] (03CR) 10Brouberol: [C:03+2] Provision the OIDC config for airflow-sre [puppet] - 10https://gerrit.wikimedia.org/r/1228448 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [11:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:20:15] (03CR) 10Majavah: [C:03+2] P:openstack: enc_client: Fix file path [puppet] - 10https://gerrit.wikimedia.org/r/1228440 (owner: 10Majavah) [11:21:05] (03CR) 10Brouberol: [C:03+2] Grow storage on dse-k8s pg_airflow_search [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228445 (https://phabricator.wikimedia.org/T411992) (owner: 10Joal) [11:26:02] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1187.eqiad.wmnet [11:26:27] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11533268 (10ops-monitoring-bot) Host an-worker1187.eqiad.wmnet rebooted by b... [11:27:13] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:28:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 42 hosts with reason: Primary switchover s4 T414542 [11:28:14] T414542: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T414542 [11:28:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1244 with weight 0 T414542', diff saved to https://phabricator.wikimedia.org/P87746 and previous config saved to /var/cache/conftool/dbconfig/20260119-112825-marostegui.json [11:29:18] !log vgutierrez@cumin1003 START - Cookbook sre.hosts.remove-downtime for 111 hosts [11:29:40] (03CR) 10Elukey: sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [11:30:23] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 111 hosts [11:31:43] !log vgutierrez@cumin1003 START - Cookbook sre.hosts.remove-downtime for cp7004.magru.wmnet [11:31:43] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp7004.magru.wmnet [11:32:03] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:33:30] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1226509 (https://phabricator.wikimedia.org/T414542) [11:33:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1187.eqiad.wmnet [11:34:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [11:34:11] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1226509 (https://phabricator.wikimedia.org/T414542) (owner: 10Gerrit maintenance bot) [11:34:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [11:34:51] !log Starting s4 eqiad failover from db1160 to db1244 - T414542 [11:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:56] T414542: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T414542 [11:35:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1244 to s4 primary T414542', diff saved to https://phabricator.wikimedia.org/P87747 and previous config saved to /var/cache/conftool/dbconfig/20260119-113518-marostegui.json [11:35:56] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1206.eqiad.wmnet [11:36:14] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11533333 (10elukey) After the above maintenance I don't see any docker or s3cmd push problem, so all this was apparently due to the ceph's replication. [11:36:15] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11533336 (10ops-monitoring-bot) Host an-worker1206.eqiad.wmnet rebooted by b... [11:37:20] (03PS1) 10Muehlenhoff: Fix the description of what restricted does after mwmaint* decom [puppet] - 10https://gerrit.wikimedia.org/r/1228457 (https://phabricator.wikimedia.org/T397017) [11:37:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1160 T414542', diff saved to https://phabricator.wikimedia.org/P87748 and previous config saved to /var/cache/conftool/dbconfig/20260119-113722-marostegui.json [11:38:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [11:43:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-search: apply [11:43:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-search: apply [11:43:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1206.eqiad.wmnet [11:44:02] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1193.eqiad.wmnet [11:44:21] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11533368 (10ops-monitoring-bot) Host an-worker1193.eqiad.wmnet rebooted by b... [11:46:28] !log intalling openjpeg2 security updates [11:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:08] !log cgoubert@cumin1003 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[2003-2004,2007-2010,2040,2043,2045,2048].codfw.wmnet [11:51:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1193.eqiad.wmnet [11:58:39] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 2 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11533383 (10elukey) @dancy @Scott_French Hi! The apus testing is finally yielding some good results, so we... [11:59:59] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest2003.codfw.wmnet [12:01:23] 07sre-alert-triage, 06Data-Platform-SRE (2026.01.05 - 2026.01.23), 07Essential-Work: Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11533389 (10BTullis) 05Open→03Resolved Zero warnings of this type, n... [12:03:38] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1228457 (https://phabricator.wikimedia.org/T397017) (owner: 10Muehlenhoff) [12:04:07] (03CR) 10Clément Goubert: [C:03+1] Remove ferm/rsync/tcpircbot settings [puppet] - 10https://gerrit.wikimedia.org/r/1228454 (https://phabricator.wikimedia.org/T397017) (owner: 10Muehlenhoff) [12:04:22] (03PS2) 10Muehlenhoff: Fix the description of what restricted does after mwmaint* decom [puppet] - 10https://gerrit.wikimedia.org/r/1228457 (https://phabricator.wikimedia.org/T397017) [12:07:39] (03PS3) 10Trueg: blazegraph: alert on ratio of failed queries increase [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) [12:08:03] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2003.codfw.wmnet [12:08:07] (03CR) 10Muehlenhoff: [C:03+2] Fix the description of what restricted does after mwmaint* decom [puppet] - 10https://gerrit.wikimedia.org/r/1228457 (https://phabricator.wikimedia.org/T397017) (owner: 10Muehlenhoff) [12:12:04] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [12:13:09] (03PS4) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on Broadcom NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) [12:16:05] !log cgoubert@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[2003-2004,2007-2010,2040,2043,2045,2048].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1003" [12:16:28] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[2003-2004,2007-2010,2040,2043,2045,2048].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1003" [12:16:28] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:16:29] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[2003-2004,2007-2010,2040,2043,2045,2048].codfw.wmnet [12:16:35] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet - https://phabricator.wikimedia.org/T409102#11533457 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1003 for... [12:16:50] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:18:07] !log cgoubert@cumin1003 START - Cookbook sre.hosts.decommission for hosts wikikube-worker[2019-2032].codfw.wmnet [12:21:20] 06SRE: please update astein puppet ssh key - https://phabricator.wikimedia.org/T414830#11533468 (10FCeratto-WMF) Related to T411679 where the access was initially granted. [12:21:36] 06SRE: please update astein puppet ssh key - https://phabricator.wikimedia.org/T414830#11533470 (10FCeratto-WMF) [12:21:39] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Requesting access to analytics-privatedata-users for astein - https://phabricator.wikimedia.org/T411679#11533471 (10FCeratto-WMF) [12:21:51] cgoubert@cumin1003 decommission (PID 2402077) is awaiting input [12:22:40] ayounsi@cumin1003 provision (PID 2402029) is awaiting input [12:23:15] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11533472 (10Ladsgroup) [12:23:17] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228469 [12:23:52] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:25:37] (03CR) 10Jgiannelos: mobileapps: Set limits on memory usage to avoid latency increase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227799 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [12:27:23] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:30:46] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [12:30:49] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [12:31:41] ayounsi@cumin1003 provision (PID 2404915) is awaiting input [12:33:50] 06SRE: please update astein puppet ssh key - https://phabricator.wikimedia.org/T414830#11533494 (10FCeratto-WMF) a:03FCeratto-WMF Pending OOB confirmation of the SSH key [12:34:18] (03CR) 10Milimetric: "I believe so, I'm just mirroring the behavior from the intake-analytics two rules above, but let me double check with @phuedx@wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [12:34:49] !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [12:36:35] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:37:53] !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [12:40:22] (03PS1) 10Federico Ceratto: admin: adding kareid to analytics-privatedata-users and deployment [puppet] - 10https://gerrit.wikimedia.org/r/1228478 (https://phabricator.wikimedia.org/T413364) [12:40:29] (03CR) 10CI reject: [V:04-1] admin: adding kareid to analytics-privatedata-users and deployment [puppet] - 10https://gerrit.wikimedia.org/r/1228478 (https://phabricator.wikimedia.org/T413364) (owner: 10Federico Ceratto) [12:40:33] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11533524 (10FCeratto-WMF) 05Open→03In progress a:03FCeratto-WMF [12:40:55] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:42:45] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11533541 (10Milimetric) approved [12:43:05] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11533543 (10FCeratto-WMF) [12:43:38] cgoubert@cumin1003 decommission (PID 2402077) is awaiting input [12:43:53] (03PS1) 10Btullis: Exclude old an-worker hosts from HDFS and YARN [puppet] - 10https://gerrit.wikimedia.org/r/1228479 (https://phabricator.wikimedia.org/T414948) [12:44:14] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:44:24] (03PS1) 10Superpes15: [itwiki] Change the temporary logo for Vector legacy and fix tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228480 (https://phabricator.wikimedia.org/T414320) [12:45:06] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7907/co" [puppet] - 10https://gerrit.wikimedia.org/r/1228479 (https://phabricator.wikimedia.org/T414948) (owner: 10Btullis) [12:48:47] (03PS1) 10Federico Ceratto: admin: add pham to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1228481 (https://phabricator.wikimedia.org/T414660) [12:49:25] (03CR) 10CI reject: [V:04-1] admin: add pham to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1228481 (https://phabricator.wikimedia.org/T414660) (owner: 10Federico Ceratto) [12:50:01] cgoubert@cumin1003 decommission (PID 2402077) is awaiting input [12:52:04] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:53:10] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [12:53:18] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T410589)', diff saved to https://phabricator.wikimedia.org/P87749 and previous config saved to /var/cache/conftool/dbconfig/20260119-125317-ladsgroup.json [12:53:22] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [12:57:42] (03PS5) 10Dreamy Jazz: Write new for CheckUser user agent table migration everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223675 (https://phabricator.wikimedia.org/T361196) [12:57:48] jouncebot: nowandnext [12:57:49] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [12:57:49] In 1 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T1400) [12:58:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223675 (https://phabricator.wikimedia.org/T361196) (owner: 10Dreamy Jazz) [12:59:07] (03Merged) 10jenkins-bot: Write new for CheckUser user agent table migration everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223675 (https://phabricator.wikimedia.org/T361196) (owner: 10Dreamy Jazz) [12:59:11] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [13:00:49] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1223675|Write new for CheckUser user agent table migration everywhere (T361196)]] [13:00:53] T361196: Write to the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T361196 [13:01:46] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:02:42] !log cgoubert@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[2019-2032].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1003" [13:03:19] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[2019-2032].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1003" [13:03:20] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:03:20] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-worker[2019-2032].codfw.wmnet [13:03:30] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet - https://phabricator.wikimedia.org/T409102#11533586 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1003 for... [13:07:33] (03PS3) 10Clément Goubert: ratelimit-media: Initial service deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) [13:08:09] (03PS3) 10Clément Goubert: Add ratelimit-media namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439) [13:10:56] Scap build and push is taking longer than usual but doesn't seem to have an error [13:11:08] Dreamy_Jazz: first monday deploy maybe? [13:11:15] Probably that [13:11:55] Yeah, nothing was deployed in the morning window so should be that [13:11:55] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [13:11:57] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [13:14:49] (03CR) 10Phuedx: trafficserver: Send /ins-502b/v2/events to intake-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [13:18:39] (03CR) 10Muehlenhoff: [C:03+2] Remove ferm/rsync/tcpircbot settings [puppet] - 10https://gerrit.wikimedia.org/r/1228454 (https://phabricator.wikimedia.org/T397017) (owner: 10Muehlenhoff) [13:19:12] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:19:49] (03PS5) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on main Broadcom NIC [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) [13:21:15] (03CR) 10Ayounsi: sre.hosts.provision: (Dell) disable LLDP on main Broadcom NIC (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [13:23:40] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1223675|Write new for CheckUser user agent table migration everywhere (T361196)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:23:45] T361196: Write to the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T361196 [13:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:24:22] (03PS1) 10Dpogorzelski: ml-build: add missing folder dependency [puppet] - 10https://gerrit.wikimedia.org/r/1228489 [13:24:31] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [13:24:43] (03CR) 10Dpogorzelski: [C:03+2] ml-build: add missing folder dependency [puppet] - 10https://gerrit.wikimedia.org/r/1228489 (owner: 10Dpogorzelski) [13:24:51] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-build: add missing folder dependency [puppet] - 10https://gerrit.wikimedia.org/r/1228489 (owner: 10Dpogorzelski) [13:31:28] !log upgrade hcaptcha-proxy nodes to Bird 2.18 T413740 [13:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:32] T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740 [13:33:54] (03CR) 10Gmodena: [C:03+1] "LGTM. Feel free to merge when ready." [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) (owner: 10Trueg) [13:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:32] (03PS2) 10Superpes15: [itwiki] Change the temporary logo for Vector legacy and fix tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228480 (https://phabricator.wikimedia.org/T414320) [13:35:17] (03CR) 10Trueg: [C:03+2] blazegraph: alert on ratio of failed queries increase [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) (owner: 10Trueg) [13:36:56] (03Merged) 10jenkins-bot: blazegraph: alert on ratio of failed queries increase [alerts] - 10https://gerrit.wikimedia.org/r/1227364 (https://phabricator.wikimedia.org/T414306) (owner: 10Trueg) [13:37:42] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223675|Write new for CheckUser user agent table migration everywhere (T361196)]] (duration: 36m 53s) [13:37:47] T361196: Write to the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T361196 [13:38:05] I'm done with deploying / using scap [13:41:37] 10SRE-swift-storage, 07Wikimedia-production-error: Timeouts towards ms-fe.svc.codfw.wmnet from jobrunners - https://phabricator.wikimedia.org/T413642#11533839 (10jijiki) 05Open→03Declined thank you @MatthewVernon ! I am closing this in favour or T414967, as I have noticed more timeouts. [13:42:24] jouncebot: nowandnext [13:42:24] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [13:42:24] In 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T1400) [13:42:42] (03CR) 10Zabe: [C:03+2] Start writing to il_target_id on large s6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227972 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [13:43:32] (03Merged) 10jenkins-bot: Start writing to il_target_id on large s6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227972 (https://phabricator.wikimedia.org/T413526) (owner: 10Zabe) [13:44:15] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1227972|Start writing to il_target_id on large s6 wikis (T413526)]] [13:44:19] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [13:46:24] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: PuppetConstantChange (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T414968 (10LSobanski) 03NEW [13:47:11] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: SmartNotHealthy (instance ml-serve1001:9100) - https://phabricator.wikimedia.org/T414969 (10LSobanski) 03NEW [13:47:41] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414970 (10LSobanski) 03NEW [13:48:21] !log zabe@deploy2002 zabe: Backport for [[gerrit:1227972|Start writing to il_target_id on large s6 wikis (T413526)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:48:39] !log zabe@deploy2002 zabe: Continuing with sync [13:51:00] 07sre-alert-triage, 06Machine-Learning-Team: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T414971 (10LSobanski) 03NEW [13:52:45] (03PS3) 10Jgiannelos: mobileapps: Set limits on memory usage to avoid latency increase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1227799 (https://phabricator.wikimedia.org/T410296) [13:52:52] !log Running populateUserAgentTable.php on group0 wikis for T413868 [13:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:56] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [13:54:46] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227972|Start writing to il_target_id on large s6 wikis (T413526)]] (duration: 10m 31s) [13:54:50] T413526: Set imagelinks migration to write both - https://phabricator.wikimedia.org/T413526 [13:54:59] (03CR) 10Vgutierrez: trafficserver: Send /ins-502b/v2/events to intake-analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [13:55:18] !log clean up nft prom file on tcp-proxy instances [13:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:24] (03PS1) 10Joal: Remove test druid cluster noisy jvm GC params [puppet] - 10https://gerrit.wikimedia.org/r/1228498 (https://phabricator.wikimedia.org/T278056) [13:56:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T413525)', diff saved to https://phabricator.wikimedia.org/P87750 and previous config saved to /var/cache/conftool/dbconfig/20260119-135602-marostegui.json [13:56:07] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [13:56:48] RESOLVED: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:57:15] (03PS4) 10Milimetric: trafficserver: Send /ins-502b/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) [13:58:05] 06SRE, 10LDAP-Access-Requests, 10Phabricator: undisable vanderwaalforces in phabricator and ldap - https://phabricator.wikimedia.org/T414774#11533967 (10FCeratto-WMF) [13:59:11] (03PS5) 10Milimetric: trafficserver: Send /ins-502b/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T1400). [14:00:05] Sergi0 and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:22] 👀 [14:00:41] o/ [14:01:02] sergi0: given yours a no-op, let's just do both at once? [14:01:06] assuming Superpes is around [14:01:32] Hi urbanecm :) [14:01:44] urbanecm: sure [14:01:46] Superpes: hey! long time no talk :) [14:01:50] before we start though... [14:01:57] Yep absolutely (my fault) :( [14:02:10] that's not what i was implying! :D [14:02:35] But it's actually true :D [14:02:35] (03Abandoned) 10Brouberol: Move mpic service mesh entry to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:02:37] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228426 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [14:02:38] (03Abandoned) 10Brouberol: test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:02:38] Superpes: what is the expected speed of the deployment? [14:02:41] (03Abandoned) 10Brouberol: test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:02:43] o/ [14:02:44] (03Abandoned) 10Brouberol: test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:02:47] (03Abandoned) 10Brouberol: testkitchen: rename the OIDC services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:02:51] (03Abandoned) 10Brouberol: testkitchen: drop the mpic.w.o domains from the ingress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212425 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:02:53] (03Abandoned) 10Brouberol: test-kitchen: drop the mpic.w.o SANs from the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212424 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:02:56] (03Abandoned) 10Brouberol: test-kitchen: set the OIDC callback URL domain to test-kitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212421 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:02:56] URLs under /static are client-cached, so this means this will only be visible once the client cache clears too [14:02:59] (03Abandoned) 10Brouberol: Rename mpic service to test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212423 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:03:02] (03Abandoned) 10Brouberol: Rename mpic-next service to test-kitchen-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212422 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:03:06] (03Abandoned) 10Brouberol: testkitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212431 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:03:07] (manually via Ctrl+Shift+R, or "in some time" otherwise) [14:03:12] Superpes: thoughts on that? [14:03:18] or Lucas_WMDE since you're waving :) [14:03:45] is purgeList not enough to fix that caching? [14:03:49] urbanecm  yep I know, no rush for me, after all, everything should be resolved within a week, right? [14:04:02] anyway I would still deploy it ^^ [14:04:10] (but I think you know more about this area than I do ^^) [14:04:39] Lucas_WMDE: that purges it from the CDN cache, but not from the browser cache on the client side [14:04:55] Superpes: more or less. if it's not time-sensitive, then it's fine. [14:05:07] 06SRE, 10observability, 10Prod-Kubernetes, 06ServiceOps new: write some recording rules for queries used in the appserver RED k8s dashboard - https://phabricator.wikimedia.org/T249663#11534004 (10jijiki) @CDanis for the time being I made some really minor changes (eg prevent choosing both DCs, removed auto... [14:05:42] (03CR) 10Urbanecm: [C:03+2] [itwiki] Change the temporary logo for Vector legacy and fix tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228480 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15) [14:05:44] (03CR) 10Urbanecm: [C:03+2] GrowthExperiments: cleanup unnecessary GEUseMetricsPlatformExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno) [14:05:48] anyway, let's go ahead then [14:06:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P87751 and previous config saved to /var/cache/conftool/dbconfig/20260119-140611-marostegui.json [14:06:15] No issue for me :) [14:06:33] (03Merged) 10jenkins-bot: [itwiki] Change the temporary logo for Vector legacy and fix tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228480 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15) [14:06:37] (03Merged) 10jenkins-bot: GrowthExperiments: cleanup unnecessary GEUseMetricsPlatformExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno) [14:06:44] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228480 (https://phabricator.wikimedia.org/T414320) (owner: 10Superpes15) [14:06:44] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno) [14:06:49] sounds like the worst case is that some people won’t see the bday logo, which sounds okay to me [14:06:58] (or that they see the old logo with a stretched aspect ratio? not sure) [14:07:00] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1228480|[itwiki] Change the temporary logo for Vector legacy and fix tagline (T414320)]], [[gerrit:1219541|GrowthExperiments: cleanup unnecessary GEUseMetricsPlatformExtension (T411479)]] [14:07:06] T414320: Requesting temporary logo change for it.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414320 [14:07:06] T411479: Turn on wgGEUseMetricsPlatformExtension by default on all Wikimedia wikis - https://phabricator.wikimedia.org/T411479 [14:07:32] they'd see what is on there now [14:07:58] (and there's a chance of dimensions mismatch, yes) [14:08:20] (03PS1) 10Brouberol: Rename mpic_next IDP services to test_kitchen_next [puppet] - 10https://gerrit.wikimedia.org/r/1228502 (https://phabricator.wikimedia.org/T407805) [14:08:29] urbanecm Also, can you confirm that, when removing a temporary logo, it's best first of all to only change it, and then after a week, the temporary logo files can be removed? I've always done it this way, but I see that not everyone does it :/ [14:08:44] (03CR) 10Kevin Bazira: [C:03+2] ml-services: rr-wikidata horizontal scaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228426 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [14:08:55] (03PS1) 10Brouberol: mpic-next: rename the oidc client_id to test_kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228505 (https://phabricator.wikimedia.org/T407805) [14:08:57] !log urbanecm@deploy2002 urbanecm, sgimeno, superpes: Backport for [[gerrit:1228480|[itwiki] Change the temporary logo for Vector legacy and fix tagline (T414320)]], [[gerrit:1219541|GrowthExperiments: cleanup unnecessary GEUseMetricsPlatformExtension (T411479)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:09:03] (03CR) 10Muehlenhoff: "There're already a patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1226854" [puppet] - 10https://gerrit.wikimedia.org/r/1228478 (https://phabricator.wikimedia.org/T413364) (owner: 10Federico Ceratto) [14:09:08] Testing [14:09:11] Superpes: it is indeed best, but as long as the deployer does NOT purge the cache (to force a 404 on the URL), it should work either way [14:09:23] thanks, also cc sergi0 if there's something to test [14:09:52] Looks good to me :) [14:10:01] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11534031 (10FCeratto-WMF) If I'm understanding correctly that this is a request for the `deployment` group, @thcipriani can you please approve it? [14:10:14] (03PS1) 10Zabe: Start writing to il_target_id everywhere except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228506 (https://phabricator.wikimedia.org/T413526) [14:10:21] perfect [14:10:26] !log urbanecm@deploy2002 urbanecm, sgimeno, superpes: Continuing with sync [14:10:30] syncing [14:10:38] (03Merged) 10jenkins-bot: ml-services: rr-wikidata horizontal scaling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228426 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [14:10:51] urbanecm Gotcha! Thanks :3 [14:14:18] (03PS1) 10Marostegui: production-m5.sql.erb: Remove old mwmaint IP [puppet] - 10https://gerrit.wikimedia.org/r/1228507 [14:14:28] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1228480|[itwiki] Change the temporary logo for Vector legacy and fix tagline (T414320)]], [[gerrit:1219541|GrowthExperiments: cleanup unnecessary GEUseMetricsPlatformExtension (T411479)]] (duration: 07m 28s) [14:14:34] T414320: Requesting temporary logo change for it.wikipedia.org (WP25) - https://phabricator.wikimedia.org/T414320 [14:14:34] T411479: Turn on wgGEUseMetricsPlatformExtension by default on all Wikimedia wikis - https://phabricator.wikimedia.org/T411479 [14:14:39] and done [14:14:41] anything else, anyone? [14:15:05] urbanecm Many thanks for your assistance :3 Not from my side :) [14:15:08] (03CR) 10Marostegui: "This is a noop - the grants will have to be checked and removed from production databases." [puppet] - 10https://gerrit.wikimedia.org/r/1228507 (owner: 10Marostegui) [14:15:11] happy to help [14:16:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, but maybe also link to https://phabricator.wikimedia.org/T397017" [puppet] - 10https://gerrit.wikimedia.org/r/1228507 (owner: 10Marostegui) [14:16:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P87752 and previous config saved to /var/cache/conftool/dbconfig/20260119-141619-marostegui.json [14:16:37] (03PS2) 10Marostegui: production-m5.sql.erb: Remove old mwmaint IP [puppet] - 10https://gerrit.wikimedia.org/r/1228507 (https://phabricator.wikimedia.org/T397017) [14:17:18] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:17:52] (03CR) 10Btullis: [C:03+2] Remove test druid cluster noisy jvm GC params [puppet] - 10https://gerrit.wikimedia.org/r/1228498 (https://phabricator.wikimedia.org/T278056) (owner: 10Joal) [14:19:33] (03CR) 10Brouberol: [C:03+1] Remove test druid cluster noisy jvm GC params [puppet] - 10https://gerrit.wikimedia.org/r/1228498 (https://phabricator.wikimedia.org/T278056) (owner: 10Joal) [14:20:33] (03CR) 10Marostegui: "Grants are still in the DB, just confirmed that, so I will delete them from there too." [puppet] - 10https://gerrit.wikimedia.org/r/1228507 (https://phabricator.wikimedia.org/T397017) (owner: 10Marostegui) [14:20:50] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:22:17] 10SRE-swift-storage: Cleanup old swift-cert - https://phabricator.wikimedia.org/T414973 (10MoritzMuehlenhoff) 03NEW [14:22:29] (03CR) 10Santiago Faci: [C:03+1] Rename mpic_next IDP services to test_kitchen_next [puppet] - 10https://gerrit.wikimedia.org/r/1228502 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:23:48] (03PS2) 10Brouberol: mpic-next: rename the oidc client_id to test_kitchen_next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228505 (https://phabricator.wikimedia.org/T407805) [14:24:48] * zabe tests something on mw-experimental [14:25:09] (03CR) 10Brouberol: [C:03+2] Rename mpic_next IDP services to test_kitchen_next [puppet] - 10https://gerrit.wikimedia.org/r/1228502 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:25:36] !log zabe@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [14:26:18] !log zabe@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [14:26:21] (03CR) 10Santiago Faci: [C:03+1] mpic-next: rename the oidc client_id to test_kitchen_next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228505 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:26:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T413525)', diff saved to https://phabricator.wikimedia.org/P87753 and previous config saved to /var/cache/conftool/dbconfig/20260119-142627-marostegui.json [14:26:33] T413525: Add il_target_id to imagelinks table in wmf production - https://phabricator.wikimedia.org/T413525 [14:26:40] (03CR) 10Brouberol: [C:03+2] mpic-next: rename the oidc client_id to test_kitchen_next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228505 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:30:02] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Network is hard down on an-worker1160.eqiad.wmnet - https://phabricator.wikimedia.org/T414942#11534099 (10Papaul) a:03VRiley-WMF [14:30:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [14:30:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [14:31:47] (03PS1) 10Brouberol: mpic: rename the oidc client_id to test_kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1228513 (https://phabricator.wikimedia.org/T407805) [14:31:49] (03PS1) 10Brouberol: test-kitchen: rewrite mpic.w.o to test-kitchen.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1228514 (https://phabricator.wikimedia.org/T407805) [14:32:06] (03CR) 10Elukey: [C:03+1] "I left a note about a possible improvement to save network calls and to avoid recomputing the same data structure two times in a row. Not " [cookbooks] - 10https://gerrit.wikimedia.org/r/1207804 (https://phabricator.wikimedia.org/T250367) (owner: 10Ayounsi) [14:32:31] (03PS1) 10Brouberol: mpic: rename the oidc client_id to test_kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228516 (https://phabricator.wikimedia.org/T407805) [14:33:39] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11534114 (10ssingh) Hi @RobH: Thanks for following up on this. Any update from the `eqsin` folks? [14:34:34] I have another patch for the window [14:34:56] (03PS1) 10Ssingh: plugins/wmf-netbox: remove ipv4 only for DNS hosts BGP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1228518 (https://phabricator.wikimedia.org/T81605) [14:35:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223635 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [14:35:41] (03CR) 10Ssingh: [C:04-2] "Thanks for the reminder -- I actually forgot about that. Patched in I07ead55fecf4d8e645115e873ebbfb0ddc7ef39f." [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [14:36:05] kostajh: ping urbanecm, but I assume you can go ahead [14:36:15] seems like they're done [14:36:25] yeah [14:36:33] I wonder if deployments interfere with mw-experimental… not sure [14:37:05] “Scap skips restarting of mw-experimental pods during deployments.” ok good :) https://wikitech.wikimedia.org/wiki/Mw-experimental [14:37:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223635 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [14:38:35] (03Merged) 10jenkins-bot: IPReputation: Define data provider, URL and developer mode config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223635 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [14:38:53] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1223635|IPReputation: Define data provider, URL and developer mode config (T410615)]] [14:38:56] (03PS2) 10Arnaudb: gerrit: change healthcheck URL for service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1228515 (https://phabricator.wikimedia.org/T408532) [14:38:58] T410615: Update Extension:IPReputation to support OpenSearch - https://phabricator.wikimedia.org/T410615 [14:39:42] (03PS1) 10Slyngshede: Release version 0.1.14 [software/bitu] - 10https://gerrit.wikimedia.org/r/1228520 [14:40:24] (03CR) 10Vgutierrez: [C:03+1] gerrit: change healthcheck URL for service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1228515 (https://phabricator.wikimedia.org/T408532) (owner: 10Arnaudb) [14:40:47] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1223635|IPReputation: Define data provider, URL and developer mode config (T410615)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:41:52] (03CR) 10Santiago Faci: [C:03+1] mpic: rename the oidc client_id to test_kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1228513 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:42:07] (03CR) 10Santiago Faci: [C:03+1] test-kitchen: rewrite mpic.w.o to test-kitchen.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1228514 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:42:12] (03PS2) 10Slyngshede: Release version 0.1.14 [software/bitu] - 10https://gerrit.wikimedia.org/r/1228520 [14:42:32] (03CR) 10Santiago Faci: [C:03+1] mpic: rename the oidc client_id to test_kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228516 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:42:45] (03CR) 10Brouberol: [C:03+2] mpic: rename the oidc client_id to test_kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1228513 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:42:48] (03CR) 10Brouberol: [C:03+2] test-kitchen: rewrite mpic.w.o to test-kitchen.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1228514 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:42:55] (03CR) 10Arnaudb: [C:03+2] gerrit: change healthcheck URL for service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1228515 (https://phabricator.wikimedia.org/T408532) (owner: 10Arnaudb) [14:42:58] (03CR) 10Brouberol: [C:03+2] mpic: rename the oidc client_id to test_kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228516 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:44:51] testing my patch [14:47:01] !log kharlan@deploy2002 kharlan: Continuing with sync [14:47:27] (03CR) 10Cathal Mooney: [C:03+1] dnsbox: codfw: advertise ns1 IPv6 (2620:0:860:53::/128) [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [14:48:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [14:48:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [14:48:30] FIRING: LibericaStaleConfig: Liberica instance lvs3010 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=esams&var-instance=lvs3010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [14:48:48] hmm that's you arnaudb ^^ [14:48:50] vgutierrez: probably the healthcheck change? [14:48:59] (03CR) 10Muehlenhoff: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1228507 (https://phabricator.wikimedia.org/T397017) (owner: 10Marostegui) [14:49:16] sukhe: yep [14:49:18] vgutierrez: I guess, should I revert? [14:49:27] arnaudb: nope, you should apply the new config :) [14:49:30] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11534166 (10elukey) Next steps: 1) Clean up the bucket from all the tests via s3cmd (only from a DC) and check replication. 2) Try to push and pull an ima... [14:49:39] oh oops, I missed that step, hold on [14:49:57] that's gonna require a pybal restart on codfw and eqiad BTW [14:50:07] liberica is just a config reload [14:50:08] could I ask you to pair up for this? [14:50:09] arnaudb: sudo cookbook sre.loadbalancer.admin --query 'P{lvs3010*}' --reason "BGP config reload" config_reload [14:50:19] and then see if all is good and repeat for others [14:50:23] ack, on it [14:50:26] sukhe is the fastest manager around :D [14:50:48] !log arnaudb@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3010*} and A:liberica [14:51:02] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1223635|IPReputation: Define data provider, URL and developer mode config (T410615)]] (duration: 12m 09s) [14:51:06] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3010*} and A:liberica [14:51:07] T410615: Update Extension:IPReputation to support OpenSearch - https://phabricator.wikimedia.org/T410615 [14:51:18] `All config_reload were successful` [14:51:44] arnaudb: cool... but you need to hit all secondary and high-traffic2 instances :) [14:51:54] or if you're feeling lazy just hit A:liberica [14:52:34] so: `sudo cookbook sre.loadbalancer.admin --query 'A:liberica' --reason "BGP config reload" config_reload` [14:52:41] yeah once you have tested it, it's fine to do it [14:52:49] yeah.. probably I'd adjust the --reason message :D [14:53:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:53:26] !log arnaudb@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading A:liberica and A:liberica [14:53:30] FIRING: [4x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [14:53:52] arnaudb: did you check on lvs3010 that the new healthcheck is working as expected? :) [14:54:12] it mentionned a successfulrestart vgutierrez I did not check any further [14:54:16] should I ^C the next run? [14:54:21] nah.. [14:54:22] I did it for you [14:54:25] https://www.irccloud.com/pastebin/RISXW7xR/ [14:54:58] can be also checked on grafana: https://grafana.wikimedia.org/goto/ESWDMhIvg?orgId=1 [14:55:49] neat, noted, thanks! [14:56:05] (03PS1) 10Brouberol: Definition of the test-kitchen chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228521 (https://phabricator.wikimedia.org/T407808) [14:56:07] (03PS1) 10Brouberol: Rename any reference of mpic into test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228522 (https://phabricator.wikimedia.org/T407808) [14:57:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1228520 (owner: 10Slyngshede) [14:57:48] (03CR) 10Kosta Harlan: (WIP) IPReputation: Enable OpenSearch IPoid provider on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [14:57:54] (03PS7) 10Kosta Harlan: (WIP) IPReputation: Enable OpenSearch IPoid provider on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) [14:57:58] (03PS8) 10Kosta Harlan: IPReputation: Enable OpenSearch IPoid provider on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) [14:58:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:58:28] (03Abandoned) 10Federico Ceratto: admin: adding kareid to analytics-privatedata-users and deployment [puppet] - 10https://gerrit.wikimedia.org/r/1228478 (https://phabricator.wikimedia.org/T413364) (owner: 10Federico Ceratto) [15:01:08] (03CR) 10Dreamy Jazz: [C:03+1] "LGTM once wmf.12 is deployed to testwikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1223636 (https://phabricator.wikimedia.org/T410615) (owner: 10Kosta Harlan) [15:02:20] arnaudb: do you need help restarting pybal on codfw & eqiad? [15:02:39] I'm finishing the lvs run, I'll need help yes please! [15:02:44] arnaudb: you've updated the config on the POPs but core datacenters are still running pybal [15:03:30] RESOLVED: [4x] LibericaStaleConfig: Liberica instance lvs3008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [15:04:59] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading A:liberica and A:liberica [15:05:46] (03PS3) 10Federico Ceratto: admin/data: Shell, deployers, analytics-privatedata-users for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm) [15:06:39] (03CR) 10Federico Ceratto: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm) [15:08:25] FIRING: [2x] SystemdUnitFailed: docker.service on ml-build1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:15] !log arnaudb@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs (T414940) [15:09:18] T414940: Handle httpd log surplus coming from Liberica - https://phabricator.wikimedia.org/T414940 [15:09:43] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw and A:lvs (T414940) [15:10:25] (03PS1) 10Kevin Bazira: ml-services: rr-wikidata update replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228527 (https://phabricator.wikimedia.org/T414060) [15:12:39] !log arnaudb@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs (T414940) [15:13:21] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad and A:lvs (T414940) [15:13:25] RESOLVED: [2x] SystemdUnitFailed: docker.service on ml-build1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:57] !log arnaudb@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-codfw and A:lvs (T414940) [15:15:01] T414940: Handle httpd log surplus coming from Liberica - https://phabricator.wikimedia.org/T414940 [15:15:27] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-codfw and A:lvs (T414940) [15:16:22] !log arnaudb@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-eqiad and A:lvs (T414940) [15:16:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and approvals are in place now" [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm) [15:16:52] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-eqiad and A:lvs (T414940) [15:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:23:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87754 and previous config saved to /var/cache/conftool/dbconfig/20260119-152300-marostegui.json [15:23:06] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:23:06] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:23:59] !log Running populateUserAgentTable.php on group1 wikis for T413868 [15:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:03] T413868: Populate the cu_useragent table and agent_id columns on WMF wikis - https://phabricator.wikimedia.org/T413868 [15:26:14] (03CR) 10JMeybohm: [C:03+1] admin/data: Shell, deployers, analytics-privatedata-users for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T1530) [15:30:37] (03PS1) 10Brouberol: Define the test-kitchen-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228528 (https://phabricator.wikimedia.org/T407808) [15:30:39] (03PS1) 10Brouberol: Define the test-kitchen-next service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228529 (https://phabricator.wikimedia.org/T407808) [15:30:41] (03PS1) 10Brouberol: Define the test-kitchen namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228530 (https://phabricator.wikimedia.org/T407808) [15:30:43] (03PS1) 10Brouberol: Define the test-kitchen service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) [15:32:26] (03CR) 10Santiago Faci: [C:03+1] Definition of the test-kitchen chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228521 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:33:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P87755 and previous config saved to /var/cache/conftool/dbconfig/20260119-153308-marostegui.json [15:34:12] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:48] (03CR) 10Gkyziridis: [C:03+1] "LGTM Thnx" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228527 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [15:35:00] (03CR) 10Kevin Bazira: [C:03+2] ml-services: rr-wikidata update replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228527 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [15:36:54] (03Merged) 10jenkins-bot: ml-services: rr-wikidata update replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228527 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [15:37:58] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:38:19] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:43:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P87756 and previous config saved to /var/cache/conftool/dbconfig/20260119-154316-marostegui.json [15:44:33] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Mon 16 Feb 2026 03:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [15:45:56] (03CR) 10Santiago Faci: [C:03+1] Rename any reference of mpic into test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228522 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:46:12] (03CR) 10Santiago Faci: [C:03+1] Define the test-kitchen-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228528 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:46:58] (03CR) 10Brouberol: [C:03+2] Definition of the test-kitchen chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228521 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:47:00] (03CR) 10Brouberol: [C:03+2] Rename any reference of mpic into test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228522 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:47:04] (03CR) 10Brouberol: [C:03+2] Define the test-kitchen-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228528 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:48:35] (03Merged) 10jenkins-bot: Definition of the test-kitchen chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228521 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:48:52] (03CR) 10Santiago Faci: Define the test-kitchen-next service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228529 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:48:54] (03Merged) 10jenkins-bot: Rename any reference of mpic into test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228522 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:49:38] (03CR) 10Brouberol: Define the test-kitchen-next service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228529 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:50:31] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: PuppetConstantChange (instance cloudidp2001-dev:9100) - https://phabricator.wikimedia.org/T414968#11534452 (10LSobanski) p:05Triage→03Low a:03SLyngshede-WMF [15:50:48] (03PS2) 10Brouberol: Define the test-kitchen-next service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228529 (https://phabricator.wikimedia.org/T407808) [15:50:48] (03PS2) 10Brouberol: Define the test-kitchen namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228530 (https://phabricator.wikimedia.org/T407808) [15:50:48] (03PS2) 10Brouberol: Define the test-kitchen service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) [15:51:02] (03CR) 10Brouberol: Define the test-kitchen-next service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228529 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:53:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87757 and previous config saved to /var/cache/conftool/dbconfig/20260119-155324-marostegui.json [15:53:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [15:53:31] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:53:32] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:53:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87758 and previous config saved to /var/cache/conftool/dbconfig/20260119-155338-marostegui.json [15:54:26] (03Merged) 10jenkins-bot: Define the test-kitchen-next namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228528 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:58:11] (03CR) 10Santiago Faci: [C:03+1] Define the test-kitchen-next service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228529 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:58:18] (03CR) 10Brouberol: [C:03+2] Define the test-kitchen-next service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228529 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [15:59:06] (03CR) 10Cathal Mooney: [C:03+1] "I don't believe this will cause much of a problem for us. Slight rise in queries hitting us perhaps but not much. As discussed if plan i" [dns] - 10https://gerrit.wikimedia.org/r/1226904 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [16:00:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:00:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:04:06] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11534500 (10elukey) I was able to clean up the whole bucket with recursive calls in few minutes, meanwhile the other day I frequently got HTTP 504s. So poi... [16:06:13] (03CR) 10BCornwall: prometheus: add depooled cp* host check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [16:06:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:06:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:07:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:08:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:08:22] 06SRE, 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#11534522 (10elukey) Tried to push and pull one image, super fast: ` elukey@build2002:~$ sudo docker push registry1004.eqiad.wmnet:5002/calico/typha Using... [16:09:04] (03CR) 10Phuedx: [C:03+1] trafficserver: Send /ins-502b/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) (owner: 10Milimetric) [16:10:14] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1225043 (owner: 10Ncmonitor) [16:11:07] !log brett@dns1006 START - running authdns-update [16:11:36] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [16:12:13] !log brett@dns1006 END - running authdns-update [16:13:17] (03PS3) 10Brouberol: Define the test-kitchen namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228530 (https://phabricator.wikimedia.org/T407808) [16:13:18] (03PS3) 10Brouberol: Define the test-kitchen service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) [16:13:18] (03PS1) 10Brouberol: Fix duplicate discovery domain in ingress FQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228542 (https://phabricator.wikimedia.org/T407808) [16:13:29] (03CR) 10CI reject: [V:04-1] Define the test-kitchen namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228530 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [16:13:32] (03CR) 10CI reject: [V:04-1] Define the test-kitchen service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [16:13:36] (03CR) 10CI reject: [V:04-1] Fix duplicate discovery domain in ingress FQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228542 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [16:13:43] (03PS2) 10Brouberol: Fix duplicate discovery domain in ingress FQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228542 (https://phabricator.wikimedia.org/T407808) [16:14:07] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [16:14:56] (03CR) 10Santiago Faci: [C:03+1] Fix duplicate discovery domain in ingress FQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228542 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [16:15:24] (03PS3) 10Brouberol: Fix duplicate discovery domain in ingress FQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228542 (https://phabricator.wikimedia.org/T407808) [16:15:24] (03PS4) 10Brouberol: Define the test-kitchen namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228530 (https://phabricator.wikimedia.org/T407808) [16:15:24] (03PS4) 10Brouberol: Define the test-kitchen service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) [16:18:30] (03CR) 10Brouberol: [C:03+2] Fix duplicate discovery domain in ingress FQDNs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228542 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [16:22:19] (03PS1) 10Federico Ceratto: charts: add generic webservice chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228543 (https://phabricator.wikimedia.org/T414112) [16:22:19] (03CR) 10Federico Ceratto: "As discussed on IRC :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228543 (https://phabricator.wikimedia.org/T414112) (owner: 10Federico Ceratto) [16:23:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [16:23:23] (03PS1) 10Federico Ceratto: helmfile.d: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228544 (https://phabricator.wikimedia.org/T414112) [16:24:22] (03PS2) 10Federico Ceratto: charts: add generic webservice chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228543 (https://phabricator.wikimedia.org/T414112) [16:26:52] (03CR) 10Federico Ceratto: [C:03+2] admin/data: Shell, deployers, analytics-privatedata-users for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1226854 (https://phabricator.wikimedia.org/T413364) (owner: 10JMeybohm) [16:30:04] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T1630) [16:36:18] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T414808]]' 'Celebrate Women/Events' 'Celebrate Women/Events/2025' Ammarpad # T414808 [16:36:22] T414808: Request to move translatable page: Celebrate Women/Events (2026) - https://phabricator.wikimedia.org/T414808 [16:37:19] (03PS1) 10Kevin Bazira: ml-services: rr-wikidata reduce memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228548 (https://phabricator.wikimedia.org/T414060) [16:39:28] (03PS1) 10Dpogorzelski: docker registry: Add ml-build user to regular push [puppet] - 10https://gerrit.wikimedia.org/r/1228549 [16:41:51] (03CR) 10Elukey: "@jmeybohm@wikimedia.org o/ Dawid tried to push from ml-build but he got an auth denied response:" [puppet] - 10https://gerrit.wikimedia.org/r/1228549 (owner: 10Dpogorzelski) [16:42:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11534704 (10FCeratto-WMF) @KReid-WMF access configured - can you please confirm it works so we can close the task? Thanks [16:55:43] !log ammarpad@deploy2002 mwscript-k8s job started: refreshImageMetadata.php --wiki=commonswiki --mediatype=AUDIO --mime=audio/mid --force --start=Segne_du,_Maria.mid --end=Segne_du,_Maria.mid # T414642 [16:55:47] T414642: Run refreshMetadata --force for two broken midi files - https://phabricator.wikimedia.org/T414642 [16:58:51] (03CR) 10Dpogorzelski: [C:03+1] ml-services: rr-wikidata reduce memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228548 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [16:59:41] (03CR) 10Kevin Bazira: [C:03+2] ml-services: rr-wikidata reduce memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228548 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [17:01:36] (03Merged) 10jenkins-bot: ml-services: rr-wikidata reduce memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228548 (https://phabricator.wikimedia.org/T414060) (owner: 10Kevin Bazira) [17:02:45] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:03:02] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:04:14] 06SRE, 10LDAP-Access-Requests, 10Phabricator: undisable vanderwaalforces in phabricator and ldap - https://phabricator.wikimedia.org/T414774#11534780 (10taavi) 05Open→03Resolved a:03taavi [17:06:37] 06SRE, 10LDAP-Access-Requests, 10Phabricator: undisable vanderwaalforces in phabricator and ldap - https://phabricator.wikimedia.org/T414774#11534784 (10Novem_Linguae) [17:07:47] 06SRE, 10LDAP-Access-Requests, 10Phabricator: undisable vanderwaalforces in phabricator and ldap - https://phabricator.wikimedia.org/T414774#11534785 (10Vanderwaalforces) Thank you so much, Taavi, Novem, Aklapper, everyone for your help and support! [17:07:56] (03PS1) 10Gergő Tisza: Add WikimediaCustomizations to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228555 (https://phabricator.wikimedia.org/T410515) [17:08:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228555 (https://phabricator.wikimedia.org/T410515) (owner: 10Gergő Tisza) [17:08:29] !log ammarpad@deploy2002 mwscript-k8s job started: refreshImageMetadata.php --wiki=commonswiki --mediatype=AUDIO --mime=audio/midi --force --start=Honeysuckle_Rose_for_wikipedia.mid --end=Honeysuckle_Rose_for_wikipedia.mid # T414642 [17:08:34] T414642: Run refreshMetadata --force for two broken midi files - https://phabricator.wikimedia.org/T414642 [17:15:12] (03CR) 10Kamila Součková: [C:03+1] wikikube: Add ratelimit-media namespace [puppet] - 10https://gerrit.wikimedia.org/r/1226797 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [17:19:03] (03CR) 10Clément Goubert: [C:03+2] wikikube: Add ratelimit-media namespace [puppet] - 10https://gerrit.wikimedia.org/r/1226797 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [17:19:12] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:22:20] (03CR) 10Codename Noreste: [C:03+1] enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [17:22:39] (03PS1) 10Muehlenhoff: DNS: Enable Bird 2.18 for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1228559 (https://phabricator.wikimedia.org/T413740) [17:22:41] (03PS1) 10Muehlenhoff: DNS: Enable Bird 2.18 for all sites [puppet] - 10https://gerrit.wikimedia.org/r/1228560 (https://phabricator.wikimedia.org/T413740) [17:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:31:21] PROBLEM - SSH on bast3007 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:32:21] RECOVERY - SSH on bast3007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:49:41] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1228563 (https://phabricator.wikimedia.org/T406545) [17:58:32] !log sudo systemctl restart pybal.service on lvs2014: T414940 [17:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:37] T414940: Handle httpd log surplus coming from Liberica - https://phabricator.wikimedia.org/T414940 [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T1800) [18:00:06] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T1800). [18:01:41] (03CR) 10Ssingh: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1228559/7909/" [puppet] - 10https://gerrit.wikimedia.org/r/1228559 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [18:01:59] (03CR) 10Ssingh: [C:03+1] "(will review the other one once ulsfo is done)" [puppet] - 10https://gerrit.wikimedia.org/r/1228559 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [18:02:40] (03CR) 10Vgutierrez: [V:03+1] "VTCs are happy (tested against cp7016)" [puppet] - 10https://gerrit.wikimedia.org/r/1228563 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:04:07] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2014 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:07:39] (03CR) 10Shivaansh Singh: Add Comments namespace for shnwikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [18:10:14] (03CR) 10Kamila Součková: [C:03+1] Add ratelimit-media namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226798 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [18:10:51] (03PS2) 10Shivaansh Singh: Add Comments namespace for shnwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) [18:12:59] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:13:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [18:15:08] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:16:41] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:19:12] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:22:04] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11534944 (10ssingh) @cmooney: Per the discussion above with Arzhel, we think that `2a02:ec80:53::1/128` is better for readability and consistency with over v6 records, than the current `2a02:ec80:... [18:22:38] (03CR) 10Vgutierrez: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228563 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:26:43] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1228568 (https://phabricator.wikimedia.org/T406545) [18:28:23] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=mediawikiwiki --reason 'Requested at [[phab:T414529]]' --skip-redirect 'Extension:DynamicPageList3 ' Extension:DynamicPageList4 Ammarpad # T414529 [18:28:28] T414529: Migrate translations from DPL3 to DPL4 - https://phabricator.wikimedia.org/T414529 [18:34:50] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1228571 (https://phabricator.wikimedia.org/T406545) [18:34:52] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1228572 (https://phabricator.wikimedia.org/T406545) [18:34:53] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1228573 (https://phabricator.wikimedia.org/T406545) [18:34:55] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1228574 (https://phabricator.wikimedia.org/T406545) [18:34:57] (03PS1) 10Fabfur: cache::upload: enable global ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1228575 (https://phabricator.wikimedia.org/T406545) [18:35:50] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (magru) [puppet] - 10https://gerrit.wikimedia.org/r/1228563 (https://phabricator.wikimedia.org/T406545) [18:35:59] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/1228568 (https://phabricator.wikimedia.org/T406545) [18:36:06] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/1228571 (https://phabricator.wikimedia.org/T406545) [18:36:13] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/1228572 (https://phabricator.wikimedia.org/T406545) [18:36:20] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (drmrs) [puppet] - 10https://gerrit.wikimedia.org/r/1228573 (https://phabricator.wikimedia.org/T406545) [18:36:27] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/1228574 (https://phabricator.wikimedia.org/T406545) [18:36:33] (03PS2) 10Fabfur: cache::upload: enable global ratelimiting (esams) [puppet] - 10https://gerrit.wikimedia.org/r/1228575 (https://phabricator.wikimedia.org/T406545) [18:43:06] (03CR) 10Vgutierrez: cache::upload: enable global ratelimiting (magru) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1228563 (https://phabricator.wikimedia.org/T406545) (owner: 10Fabfur) [18:49:04] (03CR) 10Pmiazga: [C:03+1] "LGTM, tested locally - works as expected. The only thing I could pick on is the `enable_x_ratelimit_headers` config variable name. In our " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1224937 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [18:54:41] (03PS1) 10Cathal Mooney: Add ns2.wikimedia.org anycast block to anycast config [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) [18:56:37] (03CR) 10Pmiazga: [C:04-1] "Tested, works as expected, altough I'm wondering about the response content-type." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226827 (https://phabricator.wikimedia.org/T405636) (owner: 10Daniel Kinzler) [18:58:53] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11535000 (10cmooney) >>! In T81605#11534944, @ssingh wrote: > @cmooney: Per the discussion above with Arzhel, we think that `2a02:ec80:53::1/128` is better for readability and consistency with oth... [19:01:20] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11535001 (10ssingh) Many thanks @cmooney 🙏! I will go ahead with `2620:0:860:53::1/128` for `ns1` and update that everywhere in the current CRs. [19:08:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:13:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:23:05] (03CR) 10Cathal Mooney: [C:03+1] plugins/wmf-netbox: remove ipv4 only for DNS hosts BGP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1228518 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [19:25:39] (03CR) 10Ssingh: [C:03+1] Add ns2.wikimedia.org anycast block to anycast config [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [19:25:42] (03CR) 10Ssingh: [C:03+1] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [19:29:31] (03PS1) 10Alexandros Kosiaris: base::sysctl: Use modern way of fact addressing [puppet] - 10https://gerrit.wikimedia.org/r/1228580 [19:29:32] (03PS1) 10Alexandros Kosiaris: profile::base: Remove a superfluous $::site check [puppet] - 10https://gerrit.wikimedia.org/r/1228581 [19:29:32] (03PS1) 10Alexandros Kosiaris: base::sysctl: Allow more finegrained rp_filter behavior [puppet] - 10https://gerrit.wikimedia.org/r/1228582 (https://phabricator.wikimedia.org/T352956) [19:29:33] (03PS1) 10Alexandros Kosiaris: base::sysctl: Switch priority of the ubuntu-defaults stanza [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) [19:32:51] (03CR) 10CI reject: [V:04-1] base::sysctl: Allow more finegrained rp_filter behavior [puppet] - 10https://gerrit.wikimedia.org/r/1228582 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [19:33:32] (03PS10) 10Ssingh: Adjust CSP header for pdfs & videos & set enforce on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [19:34:01] (03CR) 10CI reject: [V:04-1] base::sysctl: Switch priority of the ubuntu-defaults stanza [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [19:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:21] (03PS32) 10Ssingh: varnish: Add restrictive CSP to upload.wikimedia.org for testwiki only [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [19:37:21] (03PS2) 10Cathal Mooney: Add config for authdns IPv6 public IPs [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) [19:39:17] (03CR) 10BCornwall: [C:03+1] wikimedia/wikipedia.org: match TTLs for NS and glue records [dns] - 10https://gerrit.wikimedia.org/r/1226904 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [19:40:35] (03CR) 10Ssingh: "./docker_run.sh cp1101.eqiad.wmnet 1059423" [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [19:40:50] (03PS1) 10Gergő Tisza: Enable WikimediaCustomizations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1228586 (https://phabricator.wikimedia.org/T410515) [19:40:56] (03CR) 10Ssingh: [C:03+1] "Verified the /128s in common.yaml." [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [19:41:23] (03CR) 10Ssingh: [C:03+1] Add config for authdns IPv6 public IPs (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [19:41:38] (03PS2) 10Alexandros Kosiaris: base::sysctl: Allow more finegrained rp_filter behavior [puppet] - 10https://gerrit.wikimedia.org/r/1228582 (https://phabricator.wikimedia.org/T352956) [19:41:38] (03PS2) 10Alexandros Kosiaris: base::sysctl: Switch priority of the ubuntu-defaults stanza [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) [19:43:39] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11535037 (10cmooney) As discussed I think a good way to bring this live might be: # Update the puppet repo to make the authdns boxes announce the new IPs at all sites # Merge the patch to enable... [19:43:57] (03CR) 10CI reject: [V:04-1] base::sysctl: Allow more finegrained rp_filter behavior [puppet] - 10https://gerrit.wikimedia.org/r/1228582 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [19:44:25] (03CR) 10CI reject: [V:04-1] base::sysctl: Switch priority of the ubuntu-defaults stanza [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [19:46:36] (03CR) 10Cathal Mooney: Add config for authdns IPv6 public IPs (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1228576 (https://phabricator.wikimedia.org/T81605) (owner: 10Cathal Mooney) [19:47:06] (03PS3) 10Alexandros Kosiaris: base::sysctl: Allow more finegrained rp_filter behavior [puppet] - 10https://gerrit.wikimedia.org/r/1228582 (https://phabricator.wikimedia.org/T352956) [19:47:06] (03PS3) 10Alexandros Kosiaris: base::sysctl: Switch priority of the ubuntu-defaults stanza [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) [19:52:42] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1228583 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [19:52:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, two comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1228580 (owner: 10Alexandros Kosiaris) [19:54:58] (03PS6) 10Ssingh: dnsbox: advertise ns[0-2] IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) [19:56:37] (03CR) 10Ssingh: [V:03+1 C:04-2] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7913/co" [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [20:02:01] 06SRE, 06Traffic, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11535054 (10ssingh) >>! In T81605#11535037, @cmooney wrote: > As discussed I think a good way to bring this live might be: >[...] Sounds like a plan and it makes sense -- we can test everything o... [20:19:15] (03PS1) 10BCornwall: Revert "DNSRepository: Automated MarkMonitor domain sync" [dns] - 10https://gerrit.wikimedia.org/r/1228592 [20:19:28] (03CR) 10BCornwall: [V:03+2 C:03+2] Revert "DNSRepository: Automated MarkMonitor domain sync" [dns] - 10https://gerrit.wikimedia.org/r/1228592 (owner: 10BCornwall) [20:19:34] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, all IPs match etc." [puppet] - 10https://gerrit.wikimedia.org/r/1226928 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [20:20:04] !log brett@dns1006 START - running authdns-update [20:20:56] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1225044 (owner: 10Ncmonitor) [20:20:59] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1225045 (owner: 10Ncmonitor) [20:21:11] !log brett@dns1006 END - running authdns-update [20:24:43] (03PS1) 10BCornwall: ncmonitor: Ignore game show domains [puppet] - 10https://gerrit.wikimedia.org/r/1228593 [20:31:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87759 and previous config saved to /var/cache/conftool/dbconfig/20260119-203104-marostegui.json [20:31:10] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:31:11] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:32:00] (03CR) 10Pppery: [C:03+1] ncmonitor: Ignore game show domains [puppet] - 10https://gerrit.wikimedia.org/r/1228593 (owner: 10BCornwall) [20:34:58] (03CR) 10Kamila Součková: ratelimit-media: Initial service deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [20:41:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P87760 and previous config saved to /var/cache/conftool/dbconfig/20260119-204112-marostegui.json [20:47:22] (03CR) 10Kamila Součková: [C:03+1] api-gateway: Add external services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225548 (https://phabricator.wikimedia.org/T414333) (owner: 10Clément Goubert) [20:50:51] (03CR) 10Pppery: "You seem to have overwritten this with a completely unrelated change. What?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226024 (https://phabricator.wikimedia.org/T414403) (owner: 10Shivaansh Singh) [20:51:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, January 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226366 (https://phabricator.wikimedia.org/T413592) (owner: 10Pppery) [20:51:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P87761 and previous config saved to /var/cache/conftool/dbconfig/20260119-205120-marostegui.json [20:54:27] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Grant Access to analytics-privatedata-users for hmonroy - https://phabricator.wikimedia.org/T414375#11535094 (10Ottomata) 05Open→03Invalid @HMonroy, [[ https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Edits/MediaWiki_... [20:57:33] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T2100). [21:00:05] Seawolf35, tgr, Shivaansh, and Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] here [21:00:13] o/ [21:01:28] o/ [21:01:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87762 and previous config saved to /var/cache/conftool/dbconfig/20260119-210128-marostegui.json [21:01:35] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:01:35] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:01:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [21:01:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P87763 and previous config saved to /var/cache/conftool/dbconfig/20260119-210153-marostegui.json [21:02:04] I'll retract the two extension deployment patches though, seems like I'll need to add the extension to twn.net first [21:03:40] (the one patch, because I forgot to add the other one) [21:06:15] anyways I can deploy [21:07:14] I'll need a deployer [21:07:22] likewise [21:07:30] anyone needing their patch to go out separately? [21:08:47] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1226024 seems very wrong [21:09:00] Indeed [21:09:50] The first version of the patch was reasonable - I left some feedback on it and then pointed out the existence of the deployment process since the fact that config patches don't get reviewed without being scheduled is not communicated well at all, and then they turned it into something else entirely [21:09:54] And also they aren't here [21:10:18] Probably a botched attempt to create a different patch for a different task? [21:10:31] anyway I'm removing it from this window [21:13:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [21:13:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [21:13:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226366 (https://phabricator.wikimedia.org/T413592) (owner: 10Pppery) [21:13:56] (03Merged) 10jenkins-bot: enwikiquote: Add autopatroller protection option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227493 (https://phabricator.wikimedia.org/T414711) (owner: 10Seawolf35gerrit) [21:14:00] (03Merged) 10jenkins-bot: debug: Add X-Provenance header to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226903 (https://phabricator.wikimedia.org/T412396) (owner: 10Gergő Tisza) [21:14:03] (03Merged) 10jenkins-bot: Urwikiquote: restore flipped icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1226366 (https://phabricator.wikimedia.org/T413592) (owner: 10Pppery) [21:14:22] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1227493|enwikiquote: Add autopatroller protection option (T414711)]], [[gerrit:1226903|debug: Add X-Provenance header to Logstash (T412396)]], [[gerrit:1226366|Urwikiquote: restore flipped icon (T413592)]] [21:14:30] T414711: Add an autopatroller protection level to English Wikiquote - https://phabricator.wikimedia.org/T414711 [21:14:30] T412396: Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396 [21:14:30] T413592: Urdu Wikiquote update wordmark - https://phabricator.wikimedia.org/T413592 [21:16:22] !log tgr@deploy2002 seawolf35gerrit, pppery, tgr: Backport for [[gerrit:1227493|enwikiquote: Add autopatroller protection option (T414711)]], [[gerrit:1226903|debug: Add X-Provenance header to Logstash (T412396)]], [[gerrit:1226366|Urwikiquote: restore flipped icon (T413592)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:16:50] Mine looks good [21:16:59] Mine lgtm [21:19:12] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:20:11] !log tgr@deploy2002 seawolf35gerrit, pppery, tgr: Continuing with sync [21:22:52] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:23:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.128) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:24:12] FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:24:20] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1227493|enwikiquote: Add autopatroller protection option (T414711)]], [[gerrit:1226903|debug: Add X-Provenance header to Logstash (T412396)]], [[gerrit:1226366|Urwikiquote: restore flipped icon (T413592)]] (duration: 09m 58s) [21:24:28] T414711: Add an autopatroller protection level to English Wikiquote - https://phabricator.wikimedia.org/T414711 [21:24:28] T412396: Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396 [21:24:28] T413592: Urdu Wikiquote update wordmark - https://phabricator.wikimedia.org/T413592 [21:25:00] tgr_ ty! [21:27:52] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:28:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:28:49] PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:28:49] PROBLEM - Memcached on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [21:29:12] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:39] RECOVERY - Memcached on titan1002 is OK: TCP OK - 0.008 second response time on 10.64.48.167 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [21:29:39] RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:32:52] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:33:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:34:12] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:35:31] (03CR) 10Santiago Faci: [C:03+1] Define the test-kitchen namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228530 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [21:35:43] (03CR) 10Santiago Faci: Define the test-kitchen service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228531 (https://phabricator.wikimedia.org/T407808) (owner: 10Brouberol) [21:56:15] (03PS1) 10Aqu: Allow connections to eventgates from Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228599 (https://phabricator.wikimedia.org/T411989) [21:58:18] (03PS2) 10Aqu: Allow connections to eventgates from Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228599 (https://phabricator.wikimedia.org/T411989) [22:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260119T2200). [22:00:32] 06SRE, 10MediaWiki-Debug-Logger, 06Traffic, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11535200 (10Tgr) Code-wise this is done. Should probably update some dashboards. [23:04:31] 06SRE, 10SRE-Access-Requests: Requesting access to SRE/production access for Kim.pham (kimpham in phab) - https://phabricator.wikimedia.org/T414671#11535244 (10thcipriani) Approved for deployment. @kimpham for backports, you should request `spiderpig-access` via https://idm.wikimedia.org/permissions/ to use t... [23:19:12] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:34:12] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable