[00:00:45] (03CR) 10Dzahn: [C:03+2] "[cumin2002:~] $ sudo cumin 'tcp-*' "ip addr show dev lo | grep global"" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [00:01:48] RECOVERY - MD RAID on ganeti1039 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:04:04] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11435137 (10Dzahn) This change should have been linked here. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1215240 (thanks cdanis!) It added... [00:05:31] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11435160 (10Dzahn) This should conclude the box: ` Prepare tcpproxy VMs for accepting traffic on the new public IPs ` on the parent task "Move Ge... [00:06:50] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11435162 (10Dzahn) 05In progress→03Resolved from here on anything would be just updating 2 tickets at a time. This is done and if there are s... [00:06:58] (03CR) 10Jasmine: [C:03+2] admin: Add jasmine FIDO ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1213588 (owner: 10Jasmine) [00:14:41] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:18:33] (03CR) 10Aklapper: [V:03+2 C:03+2] Replace "libphutil" with "Arcanist" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214708 (owner: 10Pppery) [00:20:21] (03CR) 10Aklapper: "I think this should also have a line "defaultbranch=wmf/stable"." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214702 (owner: 10Pppery) [00:20:53] (03CR) 10Aklapper: [V:03+2 C:03+2] Remove old list of translated languages [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214701 (owner: 10Pppery) [00:26:18] !log ladsgroup@deploy2002:~$ mwscript-k8s --follow -- findBadBlobs.php --wiki guwiktionary --mark "Corrupted UTF-8 (T351953)" --revisions 20576 [00:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:22] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [00:26:50] (03CR) 10Aklapper: [V:03+2 C:03+2] Update source strings to latest upstream [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1206983 (owner: 10Pppery) [00:27:35] !log ladsgroup@deploy2002:~$ mwscript-k8s --follow -- findBadBlobs.php --wiki huwikiquote --mark "Corrupted UTF-8 (T351953)" --revisions 3804,3808,3811,3813,3814,3818,3825 [00:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:54] (03PS1) 10Dzahn: varnish: remove ancient Noise rule from text-frontend VCL [puppet] - 10https://gerrit.wikimedia.org/r/1215329 [00:40:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1215332 [00:40:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1215332 (owner: 10TrainBranchBot) [00:43:38] (03CR) 10Pppery: "`track=1` means to apply patches to the remote and branch that your local copy is tracking. So you shouldn't need defaultbranch or default" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214702 (owner: 10Pppery) [00:51:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:52:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1215332 (owner: 10TrainBranchBot) [00:56:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:00:52] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1215339 [01:10:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1215339 (owner: 10TrainBranchBot) [01:13:59] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 06s) [01:15:38] (03CR) 10Aklapper: [V:03+2 C:03+2] "Ah, learned something. :D Thanks!" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214702 (owner: 10Pppery) [01:20:12] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:32:14] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1215339 (owner: 10TrainBranchBot) [01:40:51] (03PS1) 10RLazarus: Update to v1.35.7 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1215349 (https://phabricator.wikimedia.org/T410975) [01:42:06] (03CR) 10RLazarus: [C:03+2] Update to v1.35.7 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1215349 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [01:59:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:01:06] !log rzl@apt1002:~$ sudo -i reprepro -C component/envoy-future include bullseye-wikimedia /home/rzl/envoyproxy_1.35.7-1_amd64.changes [02:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:30] (03PS1) 10RLazarus: envoy-future: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975) [02:11:06] (03PS2) 10RLazarus: envoy-future: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975) [02:11:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T410589)', diff saved to https://phabricator.wikimedia.org/P86413 and previous config saved to /var/cache/conftool/dbconfig/20251205-021123-ladsgroup.json [02:11:28] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:12:06] (03CR) 10RLazarus: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [02:13:44] (03CR) 10Scott French: [C:03+1] envoy-future: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [02:14:17] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy-future: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [02:26:32] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P86414 and previous config saved to /var/cache/conftool/dbconfig/20251205-022631-ladsgroup.json [02:37:07] (03PS1) 10Papaul: Add my FIDO backed production SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) [02:37:52] (03CR) 10CI reject: [V:04-1] Add my FIDO backed production SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul) [02:41:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P86415 and previous config saved to /var/cache/conftool/dbconfig/20251205-024139-ladsgroup.json [02:44:36] (03PS2) 10Papaul: Add my FIDO backed production SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) [02:55:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:55:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:56:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T410589)', diff saved to https://phabricator.wikimedia.org/P86416 and previous config saved to /var/cache/conftool/dbconfig/20251205-025647-ladsgroup.json [02:56:53] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:57:04] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [02:57:12] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T410589)', diff saved to https://phabricator.wikimedia.org/P86417 and previous config saved to /var/cache/conftool/dbconfig/20251205-025711-ladsgroup.json [03:09:32] (03CR) 10Dzahn: "can you send me an email to fulfill the requirement for out-of-band verification?" [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul) [03:30:12] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:35:38] (03CR) 10CDanis: [C:03+1] trafficserver: add a map for gerrit as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1215317 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [03:40:46] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1215388 [03:40:57] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (owner: 10CDanis) [03:45:38] (03PS1) 10CDanis: WIP2 [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [03:45:47] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [03:47:05] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:50:53] (03PS2) 10CDanis: WIP2 [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [03:50:57] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [03:54:12] (03PS3) 10CDanis: WIP2 [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [03:54:15] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [03:58:00] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/output/1215389/7969/" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [03:59:04] (03PS2) 10CDanis: lvs7003: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215388 [03:59:22] (03PS4) 10CDanis: gerrit-ssh: lvs_setup but only in magru [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [03:59:31] (03PS5) 10CDanis: gerrit-ssh: lvs_setup but only in magru [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [04:00:37] (03PS1) 10Pppery: Rename "pt" locale to "pt_PT" so its translations can actually be found [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 [04:01:15] (03PS6) 10CDanis: gerrit-ssh: lvs_setup but only in magru [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [04:01:17] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [04:01:39] (03CR) 10Pppery: Rename "pt" locale to "pt_PT" so its translations can actually be found (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 (owner: 10Pppery) [04:03:25] (03PS7) 10CDanis: gerrit-ssh: lvs_setup but only in magru [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [04:03:26] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [04:06:44] (03PS2) 10Pppery: Rename various locales so their translations can actually be found [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 [04:11:30] (03CR) 10Pppery: "(Sources for these codes being correct: https://github.com/phorgeit/arcanist/blob/master/src/internationalization/locales/PhutilCzechLocal" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 (owner: 10Pppery) [04:12:59] (03PS8) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [04:13:11] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [04:14:56] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:21:08] (03PS1) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 [04:21:28] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (owner: 10CDanis) [04:52:59] (03PS3) 10Pppery: Rename various locales so their translations can actually be found [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 [05:10:01] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:12] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:35:01] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:59:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:55:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251205T0700) [07:27:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [07:28:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [07:30:12] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:31:39] (03PS1) 10Brouberol: growthbook-next: import secret from the right private value files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215450 [07:45:00] (03CR) 10Jelto: [C:03+1] "lgtm, do we also host `donate.wikipedia25.org` in miscweb wikikube?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [07:47:05] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:51:56] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11435605 (10MoritzMuehlenhoff) 05Open→03Resolved Software RAIDs have been rebuilt [07:59:28] 07SRE-Unowned: Update SSH key for kamila - https://phabricator.wikimedia.org/T411404#11435610 (10jcrespo) Updating tags, as there is nothing for the broader team/clinic duty to do, please revert when unblocked. [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251205T0800) [08:01:46] (03CR) 10Brouberol: [C:03+2] growthbook-next: import secret from the right private value files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215450 (owner: 10Brouberol) [08:02:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [08:02:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [08:02:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply [08:02:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply [08:03:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [08:04:10] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [08:14:56] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:21:42] 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO backed production SSH key for Papaul - https://phabricator.wikimedia.org/T411833#11435684 (10Peachey88) [08:24:02] (03CR) 10Jelto: "two comments in-line. Also as I said in our meeting I'd prefer testing the switchover first on the replica/spare. The cookbook refactoring" [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [08:36:28] (03PS3) 10Muehlenhoff: Add Guillaume as approver for two more analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/1212168 (https://phabricator.wikimedia.org/T276465) [08:38:18] (03CR) 10Muehlenhoff: [C:03+2] Add Guillaume as approver for two more analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/1212168 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [08:39:11] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11435726 (10MoritzMuehlenhoff) [08:51:05] 10ops-eqiad, 06DC-Ops: Wrong disk order on ml-lab1001? - https://phabricator.wikimedia.org/T411753#11435749 (10Jclark-ctr) [09:08:39] (03PS3) 10Federico Ceratto: clone.py: Upsert instance data in Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1214083 (https://phabricator.wikimedia.org/T410084) [09:16:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1229.eqiad.wmnet [09:16:36] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db1233 - Depool db1233.eqiad.wmnet to then clone it to db1229.eqiad.wmnet - fceratto@cumin1003 [09:16:54] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1233 - Depool db1233.eqiad.wmnet to then clone it to db1229.eqiad.wmnet - fceratto@cumin1003 [09:20:12] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:58:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [09:59:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [09:59:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:02:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:07:09] (03PS1) 10Muehlenhoff: Stop using puppetmaster2002 for Blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/1215548 (https://phabricator.wikimedia.org/T365798) [10:12:06] (03PS1) 10Muehlenhoff: Remove puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1215549 (https://phabricator.wikimedia.org/T365798) [10:15:30] (03CR) 10Elukey: [C:03+1] Stop using puppetmaster2002 for Blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/1215548 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:16:31] (03PS2) 10A smart kitten: SVG: do not allow native SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192528 (https://phabricator.wikimedia.org/T406023) (owner: 10TheDJ) [10:17:00] RECOVERY - Host an-worker1148 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [10:17:06] PROBLEM - SSH on an-worker1148 is CRITICAL: connect to address 10.64.142.2 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:17:34] (03CR) 10A smart kitten: "PS2 is a manual rebase to resolve merge conflicts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192528 (https://phabricator.wikimedia.org/T406023) (owner: 10TheDJ) [10:28:26] PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100% [10:29:51] (03PS1) 10Btullis: Bump spark image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215556 (https://phabricator.wikimedia.org/T410017) [10:41:06] RECOVERY - SSH on an-worker1148 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:41:09] RECOVERY - Host an-worker1148 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [10:41:19] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning [10:45:41] (03CR) 10Majavah: [V:03+1 C:03+2] P:kubernetes: deployment_server: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214509 (owner: 10Majavah) [10:49:00] RECOVERY - MegaRAID on an-worker1148 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:54:23] (03CR) 10Jelto: "thank you for preparing the patch, it looks good to me however PCC shows a different healthcheck configuration which might return a 302 in" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [10:55:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:55:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:04:25] (03PS1) 10Cathal Mooney: lvs1018: Remove vlan sub-interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1215565 (https://phabricator.wikimedia.org/T411781) [11:26:48] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning [11:28:58] does anyone know if we need to do something to fix a cronjob that failed due to the service mesh being unavailable? T411862 [11:28:59] T411862: MediaWiki periodic job wikidata-resubmit-changes-for-dispatch failed - https://phabricator.wikimedia.org/T411862 [11:30:12] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:31:14] do we need to manually delete the failed job so that it’ll resume running periodically? [11:31:30] * Lucas_WMDE may try that later but leaves some time for someone else to chime in first [11:39:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:40:55] (03PS2) 10Muehlenhoff: Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) [11:42:03] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1233.eqiad.wmnet onto db1229.eqiad.wmnet [11:42:16] kubectl delete job wikidata-resubmit-changes-for-dispatch-29415459 # T411862 [11:42:17] T411862: MediaWiki periodic job wikidata-resubmit-changes-for-dispatch failed - https://phabricator.wikimedia.org/T411862 [11:42:33] oops, that doesn’t include the user name or host name by default? TIL [11:42:43] wait, it doesn’t even include !log by default? [11:42:55] !log lucaswerkmeister-wmde@deploy2002 kubectl delete job wikidata-resubmit-changes-for-dispatch-29415459 # T411862 [11:43:08] ok that seems to have worked better [11:43:18] --help [11:43:22] yeah I figured [11:43:49] so dologmsg is just a straight pipe to IRC with no bells and whistles. good to know [11:44:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:45:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:46:01] Lucas_WMDE, per https://phabricator.wikimedia.org/search/query/tXtQf80._Cf8/#R, we seem to have a handful of them recently. I was digging into this recently too and there is a useful conversation at T410764. [11:46:01] T410764: MediaWiki periodic job startupregistrystats-mediawikiwiki failed - https://phabricator.wikimedia.org/T410764 [11:46:20] It seems like the service affected just needs a restart but I'm no expert in that area [11:46:50] xSavitar: thanks, good to know [11:46:52] I think SRE/ServiceOps may have some ideas. [11:47:05] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:47:07] I don’t know why phaultfinder didn’t create a task for the alert I encountered (AFAICT) [11:47:12] Ack! But I agree, with you, it's happening quite a bit more frequently than normal recently. [11:51:51] (03PS3) 10Muehlenhoff: Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) [11:52:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [11:53:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [11:54:58] Lucas_WMDE, I don't know why alert was not created for that either. Maybe something about wikidata needs to be added to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/%2B/refs/heads/master [11:55:14] I can see various teams sub-dir there [11:55:37] (03PS1) 10Btullis: Bump the mediawiki image used in the mediawiki-dumps-legacy-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215576 (https://phabricator.wikimedia.org/T405955) [11:55:47] https://gerrit.wikimedia.org/g/operations/alerts/+/dd2e90219cec0e4605ba0025bd496e01981f4603/team-sre/mw-cron.yaml [11:56:06] hm, no idea [11:56:12] I mean, the alert itself existed, I could see it on alerts.w.o [11:56:18] it just didn’t make a phab task [11:56:38] right, that’s probably where it came from, mw-cron.yaml [11:57:53] (03CR) 10Zoe: [C:03+1] "Yup, new key is working!" [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse) [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251205T0800) [12:00:05] jelto, arnoldokoth, mutante, and arnaudb: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251205T1200). Please do the needful. [12:00:57] Lucas_WMDE, maybe you want to file a task [12:05:34] * Lucas_WMDE tried to improve some docs at https://wikitech.wikimedia.org/wiki/dologmsg [12:05:52] xSavitar: eh, I’m okay with leaving it alone for now. maybe if it happens again [12:05:55] but thanks! [12:08:02] (also, I couldn’t find a general wikitech documentation page for !log usage in production… but maybe I missed it) [12:10:02] interestingly enough, the documentation suggests that at some point, `dologmsg` did add the !log prefix somewhere in the pipeline: https://gerrit.wikimedia.org/g/operations/puppet/+/63a8174ffd/modules/scap/files/manpages/asciidoc/dologmsg.txt#37 [12:10:13] that, or the documentation was always misleading ^^ [12:14:56] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:17:48] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 07Documentation, 07Puppet (Puppet 7.0): Puppet7: Update documentation - https://phabricator.wikimedia.org/T341095#11436234 (10jcrespo) I would like to mention in particular workflows like renewal/revoking of certificates on server workflos, pa... [12:30:01] RESOLVED: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:30:21] !log removed helm release mw-script/utk6lsuw in k8s@codfw which was in stuck in pending-install state since 9+ days [12:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:25] !log upgrade python3-sshpubkeys on idm-test1001 to 3.3.1-1~wmf12u1 T411816 [12:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:29] T411816: cannot add a FIDO-backed ssh key to Bitu - https://phabricator.wikimedia.org/T411816 [12:53:06] 06SRE, 06Infrastructure-Foundations, 10netops: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11436260 (10cmooney) 05Resolved→03Open Thanks for the work on this @MoritzMuehlenhoff! From what I can see we still have a small number of these mails coming throug... [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:28] (03CR) 10Federico Ceratto: "Tested in T411805" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215116 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [13:10:54] !log upload python3-sshpubkeys to 3.3.1-1~wmf12u1 to apt.wikimedia.org T411816 [13:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:57] T411816: cannot add a FIDO-backed ssh key to Bitu - https://phabricator.wikimedia.org/T411816 [13:22:28] (03CR) 10Jforrester: [C:03+1] Bump the mediawiki image used in the mediawiki-dumps-legacy-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215576 (https://phabricator.wikimedia.org/T405955) (owner: 10Btullis) [13:22:34] (03CR) 10Elukey: Only select Puppet version based on the Debian release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:25:17] (03CR) 10Muehlenhoff: Only select Puppet version based on the Debian release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:26:43] (03CR) 10Btullis: [C:03+2] Bump the mediawiki image used in the mediawiki-dumps-legacy-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215576 (https://phabricator.wikimedia.org/T405955) (owner: 10Btullis) [13:26:45] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#11436294 (10taavi) [13:28:23] (03Merged) 10jenkins-bot: Bump the mediawiki image used in the mediawiki-dumps-legacy-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215576 (https://phabricator.wikimedia.org/T405955) (owner: 10Btullis) [13:30:33] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [13:31:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 20.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:33:00] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [13:33:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:36:40] (03PS1) 10Federico Ceratto: db1229: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1215599 (https://phabricator.wikimedia.org/T411652) [13:36:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:38:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:41:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [13:42:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 14.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:43:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [13:43:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:43:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [13:46:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [13:46:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [13:46:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [13:47:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply [13:49:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply [13:49:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply [13:49:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:50:25] now that is annoying... [13:50:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [13:51:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [13:52:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [13:52:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [14:02:19] (03PS1) 10Daniel Kinzler: rest gateway: add smoke tests [WIP] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 [14:02:28] (03CR) 10CI reject: [V:04-1] rest gateway: add smoke tests [WIP] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler) [14:03:32] (03CR) 10Jelto: [C:03+2] vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [14:04:07] (03CR) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [14:05:06] (03Merged) 10jenkins-bot: vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [14:08:39] !log stopped puppet on wikikube-ctrl2* and restarted kube-apiserver to temporarily extend audit logging [14:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [14:12:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [14:16:25] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T410589)', diff saved to https://phabricator.wikimedia.org/P86425 and previous config saved to /var/cache/conftool/dbconfig/20251205-142150-ladsgroup.json [14:21:54] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [14:22:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [14:23:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [14:25:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [14:25:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [14:26:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [14:27:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [14:33:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:36:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P86426 and previous config saved to /var/cache/conftool/dbconfig/20251205-143658-ladsgroup.json [14:39:42] (03PS2) 10Btullis: Bump spark image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215556 (https://phabricator.wikimedia.org/T410017) [14:41:45] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul) [14:42:36] (03CR) 10Btullis: [C:03+2] Bump spark image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215556 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:44:05] 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: No space left on device on VRTS host - https://phabricator.wikimedia.org/T411452#11436505 (10Jelto) 05Open→03Resolved a:03Arnoldokoth Thanks @Arnoldokoth for enabling the cleanup job again. The [inode metrics](https://grafa... [14:44:39] (03Merged) 10jenkins-bot: Bump spark image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215556 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [14:45:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [14:46:05] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [14:50:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [14:51:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [14:51:22] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:51:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [14:52:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P86427 and previous config saved to /var/cache/conftool/dbconfig/20251205-145206-ladsgroup.json [14:52:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [14:55:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [14:55:30] (03CR) 10Elukey: [C:03+1] Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [14:56:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [14:57:01] (03CR) 10Ladsgroup: [C:03+1] "Icinga is green" [puppet] - 10https://gerrit.wikimedia.org/r/1215599 (https://phabricator.wikimedia.org/T411652) (owner: 10Federico Ceratto) [15:02:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:02:12] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:07:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T410589)', diff saved to https://phabricator.wikimedia.org/P86428 and previous config saved to /var/cache/conftool/dbconfig/20251205-150713-ladsgroup.json [15:07:17] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:07:30] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [15:07:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1189 (T410589)', diff saved to https://phabricator.wikimedia.org/P86429 and previous config saved to /var/cache/conftool/dbconfig/20251205-150737-ladsgroup.json [15:08:12] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:10:01] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:11] (03PS1) 10Superpes15: [enwikibooks] Allow sysops to revert abusefilter and grant/revoke confirmed and accountcreator flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215625 (https://phabricator.wikimedia.org/T411828) [15:12:02] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:27:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 16.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:29:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:30:12] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:30:55] !log creating ores tables on thwiki (T409438) [15:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:59] T409438: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438 [15:32:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 24.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:34:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:34:23] (03PS1) 10Btullis: Remove incorrect hive.server2 settings and correct the k8s URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215627 (https://phabricator.wikimedia.org/T406833) [15:35:01] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:46] (03CR) 10Federico Ceratto: [C:03+2] db1229: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1215599 (https://phabricator.wikimedia.org/T411652) (owner: 10Federico Ceratto) [15:38:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:42:02] (03CR) 10Btullis: [C:03+2] Remove incorrect hive.server2 settings and correct the k8s URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215627 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:44:08] (03Merged) 10jenkins-bot: Remove incorrect hive.server2 settings and correct the k8s URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215627 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:47:05] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:50:29] (03CR) 10Papaul: [C:03+2] "Email sent" [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul) [15:58:23] (03Abandoned) 10Hashar: gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar) [16:02:16] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11436743 (10Dzahn) [16:03:18] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [16:03:25] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [16:06:15] (03CR) 10Dzahn: [C:03+1] "got the email but see it's already resolved:)" [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul) [16:11:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:13:40] (03CR) 10JHathaway: [C:03+1] Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:13:48] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11436784 (10Dzahn) >>! In T408592#11386273, @ATitkov wrote: > Forgot to add the current repo [[ https://gitlab.wikimedia.org/toolforge-repos/... [16:14:56] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:17:47] (03CR) 10Dzahn: "per discussion today: at least initially we are not going to move donate.wikipedia25.org - it will stay in the ncredir cluster with the ex" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:18:34] (03CR) 10Dzahn: [C:03+1] "if you feel like deploying this, go ahead. if not I will get back to it Tuesday. or we can do it in the Tuesday meeting." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:19:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:22:13] (03CR) 10Dzahn: [C:03+2] trafficserver: add a map for gerrit as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1215317 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [16:25:40] (03CR) 10Dzahn: [C:03+1] gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [16:39:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:48:59] 10ops-eqiad, 06SRE, 06DC-Ops: Wrong disk order on ml-lab1001? - https://phabricator.wikimedia.org/T411753#11436870 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Removed 2x drives 4:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sda [5:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sdc [16:50:33] (03PS3) 10CDanis: lvs7003: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215388 [16:50:33] (03PS9) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [16:50:33] (03PS2) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 [16:50:48] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [16:50:51] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (owner: 10CDanis) [16:51:11] (03CR) 10CDanis: "Yes, you are very right, and apologies for forgetting this after the first time you pointed it out." [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [16:52:05] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [16:56:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:02:20] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:02:42] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:07:07] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:10:03] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:10:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:10:49] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:11:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [17:12:23] (03CR) 10Majavah: [C:03+1] P:toolforge:prometheus: scrape mariadb metrics [puppet] - 10https://gerrit.wikimedia.org/r/1215121 (https://phabricator.wikimedia.org/T410505) (owner: 10FNegri) [17:14:23] (03CR) 10Elukey: ml-build: define new machine name/type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [17:18:18] (03CR) 10Dzahn: "yea, in our blackbox monitoring checks we also accept "200 OR 302"" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [17:18:22] (03CR) 10Dzahn: [C:03+1] gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis) [17:23:35] !log add updated ssh firewall filter config to pfw1-eqiad.wikimedia.org T390939 [17:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:18] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411818#11436967 (10Jclark-ctr) @cmooney I might have closed out T411684 prematurely. I had noticed the spike occurred at the... [17:27:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 9.539% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:28:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:29:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:32:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 23.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:33:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:38:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:40:19] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411818#11437045 (10Jclark-ctr) a:03Jclark-ctr [17:42:12] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T411883 (10Leif_WMDE) 03NEW [17:42:56] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T411883#11437057 (10Leif_WMDE) a:03Lena_WMDE [17:44:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:45:00] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:49:45] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:51:59] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215663 [18:02:10] (03PS1) 10Btullis: Add an extra egress rule for airflow-main to allow uploading to frack s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740) [18:06:18] (03Abandoned) 10Jforrester: Wikifunctions SLO: Adjust upper bucket to 10.1s to cover slow reporting [puppet] - 10https://gerrit.wikimedia.org/r/1192609 (https://phabricator.wikimedia.org/T394057) (owner: 10Jforrester) [18:06:52] (03CR) 10Btullis: [C:03+2] Add an extra egress rule for airflow-main to allow uploading to frack s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis) [18:08:55] (03Merged) 10jenkins-bot: Add an extra egress rule for airflow-main to allow uploading to frack s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis) [18:10:34] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [18:10:52] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [18:16:40] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:17:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [18:18:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [18:21:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11437166 (10Pppery) [18:21:47] (03PS1) 10Btullis: Add a certificate for the frtech root CA to airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740) [18:22:55] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis) [18:27:20] (03PS2) 10Btullis: Add a certificate for the frtech root CA to airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740) [18:27:59] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [18:28:31] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [18:34:06] (03PS1) 10Jelto: gerrit: allow https traffic to both interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) [18:36:22] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7797/co" [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [18:36:37] (03CR) 10Dzahn: "oh yes, absolutely. good point/catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [18:38:44] (03CR) 10CDanis: [C:03+1] "thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [18:40:44] (03PS3) 10Btullis: Add a certificate and an S3 connection to airflow-main for frtech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740) [18:48:35] (03PS4) 10Btullis: Add a certificate and an S3 connection to airflow-main for frtech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740) [18:51:34] FIRING: DiskSpace: Disk space serpens:9100:/ 3.153% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:52:08] (03CR) 10Btullis: [C:03+2] Add a certificate and an S3 connection to airflow-main for frtech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis) [18:54:10] (03Merged) 10jenkins-bot: Add a certificate and an S3 connection to airflow-main for frtech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis) [18:55:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:56:25] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1215673/7798/" [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [18:56:59] (03CR) 10CDanis: [C:03+2] gerrit: allow https traffic to both interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:09:36] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [19:10:14] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [19:22:36] (03PS2) 10Dzahn: admin: Remove unused SSH key for Zoe [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse) [19:22:38] (03CR) 10Dzahn: [C:03+2] admin: Remove unused SSH key for Zoe [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse) [19:30:12] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:31:05] (03CR) 10Dzahn: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse) [19:33:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting update of SSH key for zoe - https://phabricator.wikimedia.org/T411506#11437363 (10Dzahn) I think this is now resolved. [19:34:32] 06SRE, 10SRE-Access-Requests: Add FIDO backed production SSH key for Papaul - https://phabricator.wikimedia.org/T411833#11437365 (10Dzahn) This seems to be resolved now. [19:38:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11437372 (10Dzahn) Hello @Leif_WMDE you can kick-off the process early by sending an email to [[ https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) | Katie Fra... [19:39:05] (03PS1) 10CDanis: ats: gerrit: don't validate TLS host for now [puppet] - 10https://gerrit.wikimedia.org/r/1215684 [19:39:27] (03PS2) 10CDanis: ats: gerrit: don't validate TLS host for now [puppet] - 10https://gerrit.wikimedia.org/r/1215684 [19:39:41] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (owner: 10CDanis) [19:41:34] RESOLVED: DiskSpace: Disk space serpens:9100:/ 0.5635% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:43:50] (03CR) 10Dzahn: [C:03+1] "not claiming to review the actual ATS config but the problem is clear and a blocker and since this only touches the gerrit map that has ju" [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (owner: 10CDanis) [19:44:47] (03PS1) 10Btullis: Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686 [19:45:41] (03CR) 10Aleksandar Mastilovic: [C:03+1] Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686 (owner: 10Btullis) [19:46:12] (03CR) 10Btullis: [C:03+2] Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686 (owner: 10Btullis) [19:46:20] (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686 (owner: 10Btullis) [19:47:05] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:48:22] (03Merged) 10jenkins-bot: Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686 (owner: 10Btullis) [19:49:33] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [19:50:24] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [20:10:25] (03PS1) 10Scott French: shellbox-constraints: bump replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215672 [20:10:28] (03CR) 10Scott French: [C:03+2] shellbox-constraints: bump replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215672 (owner: 10Scott French) [20:12:42] (03Merged) 10jenkins-bot: shellbox-constraints: bump replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215672 (owner: 10Scott French) [20:14:56] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:16:43] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [20:17:11] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [20:18:13] (03PS3) 10CDanis: ats: gerrit: don't validate TLS host for now [puppet] - 10https://gerrit.wikimedia.org/r/1215684 [20:18:14] (03PS4) 10CDanis: lvs7003: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215388 [20:18:14] (03PS10) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 [20:18:14] (03PS3) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 [20:21:41] (03PS4) 10CDanis: ats: gerrit: don't validate TLS host for now [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) [20:21:43] (03PS5) 10CDanis: lvs7003: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) [20:21:45] (03PS11) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) [20:21:47] (03PS4) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) [20:21:49] (03PS1) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [20:38:47] (03PS1) 10Dzahn: add gerrit-ssh and gerrit-https to liberica services on lvs7003 [puppet] - 10https://gerrit.wikimedia.org/r/1215699 (https://phabricator.wikimedia.org/T411895) [20:40:37] (03PS6) 10CDanis: lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895) [20:40:37] (03PS12) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895) [20:40:37] (03PS5) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895) [20:40:38] (03PS2) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895) [20:40:55] (03Abandoned) 10Dzahn: add gerrit-ssh and gerrit-https to liberica services on lvs7003 [puppet] - 10https://gerrit.wikimedia.org/r/1215699 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [21:03:05] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [21:03:19] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [21:04:57] (03CR) 10Dzahn: "in the future this should be reverted for https://phabricator.wikimedia.org/T411904" [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [21:14:00] (03PS1) 10Dzahn: switch gerrit service IP to CDN [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895) [21:16:24] (03CR) 10Dzahn: "not quite yet but not far away" [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [21:18:55] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [21:19:26] !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [21:29:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:38:33] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [21:48:59] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [21:50:55] (03CR) 10BCornwall: switch gerrit service IP to CDN (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [21:52:07] (03CR) 10Dzahn: switch gerrit service IP to CDN (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn) [21:52:32] (03PS2) 10Dzahn: switch gerrit service IP to CDN [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895) [21:56:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [21:57:48] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [22:06:39] (03PS1) 10Aleksandar Mastilovic: Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717 [22:07:07] (03CR) 10Xcollazo: [C:03+1] Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717 (owner: 10Aleksandar Mastilovic) [22:07:49] (03CR) 10Aleksandar Mastilovic: [C:03+1] Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717 (owner: 10Aleksandar Mastilovic) [22:09:09] (03CR) 10Btullis: [C:03+2] Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717 (owner: 10Aleksandar Mastilovic) [22:09:10] (03CR) 10Bking: [V:03+2 C:03+2] Add an extra egress rule for airflow-main to allow uploading to frack s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis) [22:10:04] (03CR) 10Bking: [V:03+2 C:03+2] Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717 (owner: 10Aleksandar Mastilovic) [22:10:56] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [22:11:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [22:11:41] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [22:11:59] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [22:16:40] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [22:29:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [22:31:57] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [22:32:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [22:33:43] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [22:34:00] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [22:34:22] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [22:35:06] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [22:55:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:30:12] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:47:05] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown