[00:00:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "[cumin2002:~] $ sudo cumin 'tcp-*' "ip addr show dev lo | grep global"" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis)
[00:01:48] <icinga-wm>	 RECOVERY - MD RAID on ganeti1039 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:04:04] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11435137 (10Dzahn) This change should have been linked here.  https://gerrit.wikimedia.org/r/c/operations/puppet/+/1215240 (thanks cdanis!)  It added...
[00:05:31] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11435160 (10Dzahn) This should conclude the box:   `  Prepare tcpproxy VMs for accepting traffic on the new public IPs `  on the parent task "Move Ge...
[00:06:50] <wikibugs>	 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11435162 (10Dzahn) 05In progress→03Resolved from here on anything would be just updating 2 tickets at a time. This is done and if there are s...
[00:06:58] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] admin: Add jasmine FIDO ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1213588 (owner: 10Jasmine)
[00:14:41] <jinxer-wm>	 FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[00:18:33] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Replace "libphutil" with "Arcanist" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214708 (owner: 10Pppery)
[00:20:21] <wikibugs>	 (03CR) 10Aklapper: "I think this should also have a line "defaultbranch=wmf/stable"." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214702 (owner: 10Pppery)
[00:20:53] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Remove old list of translated languages [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214701 (owner: 10Pppery)
[00:26:18] <Amir1>	 !log ladsgroup@deploy2002:~$ mwscript-k8s --follow -- findBadBlobs.php --wiki guwiktionary --mark "Corrupted UTF-8 (T351953)" --revisions 20576
[00:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:22] <stashbot>	 T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953
[00:26:50] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Update source strings to latest upstream [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1206983 (owner: 10Pppery)
[00:27:35] <Amir1>	 !log ladsgroup@deploy2002:~$ mwscript-k8s --follow -- findBadBlobs.php --wiki huwikiquote --mark "Corrupted UTF-8 (T351953)" --revisions 3804,3808,3811,3813,3814,3818,3825
[00:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:54] <wikibugs>	 (03PS1) 10Dzahn: varnish: remove ancient Noise rule from text-frontend VCL [puppet] - 10https://gerrit.wikimedia.org/r/1215329
[00:40:12] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1215332
[00:40:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1215332 (owner: 10TrainBranchBot)
[00:43:38] <wikibugs>	 (03CR) 10Pppery: "`track=1` means to apply patches to the remote and branch that your local copy is tracking. So you shouldn't need defaultbranch or default" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214702 (owner: 10Pppery)
[00:51:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:52:04] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1215332 (owner: 10TrainBranchBot)
[00:56:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:00:52] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:10:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1215339
[01:10:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1215339 (owner: 10TrainBranchBot)
[01:13:59] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 06s)
[01:15:38] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] "Ah, learned something. :D Thanks!" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214702 (owner: 10Pppery)
[01:20:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[01:32:14] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1215339 (owner: 10TrainBranchBot)
[01:40:51] <wikibugs>	 (03PS1) 10RLazarus: Update to v1.35.7 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1215349 (https://phabricator.wikimedia.org/T410975)
[01:42:06] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Update to v1.35.7 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1215349 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)
[01:59:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:01:06] <rzl>	 !log rzl@apt1002:~$ sudo -i reprepro -C component/envoy-future include bullseye-wikimedia /home/rzl/envoyproxy_1.35.7-1_amd64.changes
[02:01:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:10:30] <wikibugs>	 (03PS1) 10RLazarus: envoy-future: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975)
[02:11:06] <wikibugs>	 (03PS2) 10RLazarus: envoy-future: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975)
[02:11:24] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T410589)', diff saved to https://phabricator.wikimedia.org/P86413 and previous config saved to /var/cache/conftool/dbconfig/20251205-021123-ladsgroup.json
[02:11:28] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[02:12:06] <wikibugs>	 (03CR) 10RLazarus: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)
[02:13:44] <wikibugs>	 (03CR) 10Scott French: [C:03+1] envoy-future: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)
[02:14:17] <wikibugs>	 (03CR) 10RLazarus: [V:03+2 C:03+2] envoy-future: Update to v1.35.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1215363 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)
[02:26:32] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P86414 and previous config saved to /var/cache/conftool/dbconfig/20251205-022631-ladsgroup.json
[02:37:07] <wikibugs>	 (03PS1) 10Papaul: Add my FIDO backed production SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833)
[02:37:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add my FIDO backed production SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul)
[02:41:40] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P86415 and previous config saved to /var/cache/conftool/dbconfig/20251205-024139-ladsgroup.json
[02:44:36] <wikibugs>	 (03PS2) 10Papaul: Add my FIDO backed production SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833)
[02:55:09] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:55:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:56:48] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T410589)', diff saved to https://phabricator.wikimedia.org/P86416 and previous config saved to /var/cache/conftool/dbconfig/20251205-025647-ladsgroup.json
[02:56:53] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[02:57:04] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[02:57:12] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T410589)', diff saved to https://phabricator.wikimedia.org/P86417 and previous config saved to /var/cache/conftool/dbconfig/20251205-025711-ladsgroup.json
[03:09:32] <wikibugs>	 (03CR) 10Dzahn: "can you send me an email to fulfill the requirement for out-of-band verification?" [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul)
[03:30:12] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:35:38] <wikibugs>	 (03CR) 10CDanis: [C:03+1] trafficserver: add a map for gerrit as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1215317 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[03:40:46] <wikibugs>	 (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1215388
[03:40:57] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (owner: 10CDanis)
[03:45:38] <wikibugs>	 (03PS1) 10CDanis: WIP2 [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[03:45:47] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[03:47:05] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[03:50:53] <wikibugs>	 (03PS2) 10CDanis: WIP2 [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[03:50:57] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[03:54:12] <wikibugs>	 (03PS3) 10CDanis: WIP2 [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[03:54:15] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[03:58:00] <wikibugs>	 (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/output/1215389/7969/" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[03:59:04] <wikibugs>	 (03PS2) 10CDanis: lvs7003: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215388
[03:59:22] <wikibugs>	 (03PS4) 10CDanis: gerrit-ssh: lvs_setup but only in magru [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[03:59:31] <wikibugs>	 (03PS5) 10CDanis: gerrit-ssh: lvs_setup but only in magru [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[04:00:37] <wikibugs>	 (03PS1) 10Pppery: Rename "pt" locale to "pt_PT" so its translations can actually be found [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393
[04:01:15] <wikibugs>	 (03PS6) 10CDanis: gerrit-ssh: lvs_setup but only in magru [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[04:01:17] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[04:01:39] <wikibugs>	 (03CR) 10Pppery: Rename "pt" locale to "pt_PT" so its translations can actually be found (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 (owner: 10Pppery)
[04:03:25] <wikibugs>	 (03PS7) 10CDanis: gerrit-ssh: lvs_setup but only in magru [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[04:03:26] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[04:06:44] <wikibugs>	 (03PS2) 10Pppery: Rename various locales so their translations can actually be found [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393
[04:11:30] <wikibugs>	 (03CR) 10Pppery: "(Sources for these codes being correct: https://github.com/phorgeit/arcanist/blob/master/src/internationalization/locales/PhutilCzechLocal" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393 (owner: 10Pppery)
[04:12:59] <wikibugs>	 (03PS8) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[04:13:11] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[04:14:56] <jinxer-wm>	 FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[04:21:08] <wikibugs>	 (03PS1) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398
[04:21:28] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (owner: 10CDanis)
[04:52:59] <wikibugs>	 (03PS3) 10Pppery: Rename various locales so their translations can actually be found [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1215393
[05:10:01] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:20:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[05:35:01] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:59:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:55:09] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:55:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251205T0700)
[07:27:57] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply
[07:28:05] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply
[07:30:12] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:31:39] <wikibugs>	 (03PS1) 10Brouberol: growthbook-next: import secret from the right private value files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215450
[07:45:00] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, do we also host `donate.wikipedia25.org` in miscweb wikikube?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[07:47:05] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:51:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11435605 (10MoritzMuehlenhoff) 05Open→03Resolved Software RAIDs have been rebuilt
[07:59:28] <wikibugs>	 07SRE-Unowned: Update SSH key for kamila - https://phabricator.wikimedia.org/T411404#11435610 (10jcrespo) Updating tags, as there is nothing for the broader team/clinic duty to do, please revert when unblocked.
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251205T0800)
[08:01:46] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] growthbook-next: import secret from the right private value files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215450 (owner: 10Brouberol)
[08:02:14] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply
[08:02:25] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply
[08:02:41] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply
[08:02:59] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply
[08:03:07] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply
[08:04:10] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply
[08:14:56] <jinxer-wm>	 FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[08:21:42] <wikibugs>	 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO backed production SSH key for Papaul - https://phabricator.wikimedia.org/T411833#11435684 (10Peachey88)
[08:24:02] <wikibugs>	 (03CR) 10Jelto: "two comments in-line. Also as I said in our meeting I'd prefer testing the switchover first on the replica/spare. The cookbook refactoring" [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb)
[08:36:28] <wikibugs>	 (03PS3) 10Muehlenhoff: Add Guillaume as approver for two more analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/1212168 (https://phabricator.wikimedia.org/T276465)
[08:38:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add Guillaume as approver for two more analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/1212168 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff)
[08:39:11] <wikibugs>	 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11435726 (10MoritzMuehlenhoff)
[08:51:05] <wikibugs>	 10ops-eqiad, 06DC-Ops: Wrong disk order on ml-lab1001? - https://phabricator.wikimedia.org/T411753#11435749 (10Jclark-ctr)
[09:08:39] <wikibugs>	 (03PS3) 10Federico Ceratto: clone.py: Upsert instance data in Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1214083 (https://phabricator.wikimedia.org/T410084)
[09:16:32] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1229.eqiad.wmnet
[09:16:36] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool db1233 - Depool db1233.eqiad.wmnet to then clone it to db1229.eqiad.wmnet - fceratto@cumin1003
[09:16:54] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1233 - Depool db1233.eqiad.wmnet to then clone it to db1229.eqiad.wmnet - fceratto@cumin1003
[09:20:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:58:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[09:59:17] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[09:59:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:02:26] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[10:02:50] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[10:07:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Stop using puppetmaster2002 for Blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/1215548 (https://phabricator.wikimedia.org/T365798)
[10:12:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1215549 (https://phabricator.wikimedia.org/T365798)
[10:15:30] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Stop using puppetmaster2002 for Blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/1215548 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:16:31] <wikibugs>	 (03PS2) 10A smart kitten: SVG: do not allow native SVG rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192528 (https://phabricator.wikimedia.org/T406023) (owner: 10TheDJ)
[10:17:00] <icinga-wm>	 RECOVERY - Host an-worker1148 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[10:17:06] <icinga-wm>	 PROBLEM - SSH on an-worker1148 is CRITICAL: connect to address 10.64.142.2 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:17:34] <wikibugs>	 (03CR) 10A smart kitten: "PS2 is a manual rebase to resolve merge conflicts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1192528 (https://phabricator.wikimedia.org/T406023) (owner: 10TheDJ)
[10:28:26] <icinga-wm>	 PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100%
[10:29:51] <wikibugs>	 (03PS1) 10Btullis: Bump spark image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215556 (https://phabricator.wikimedia.org/T410017)
[10:41:06] <icinga-wm>	 RECOVERY - SSH on an-worker1148 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:41:09] <icinga-wm>	 RECOVERY - Host an-worker1148 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[10:41:19] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning
[10:45:41] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:kubernetes: deployment_server: Use wmflib::ip2cidr [puppet] - 10https://gerrit.wikimedia.org/r/1214509 (owner: 10Majavah)
[10:49:00] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1148 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:54:23] <wikibugs>	 (03CR) 10Jelto: "thank you for preparing the patch, it looks good to me however PCC shows a different healthcheck configuration which might return a 302 in" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[10:55:09] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:55:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:04:25] <wikibugs>	 (03PS1) 10Cathal Mooney: lvs1018: Remove vlan sub-interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1215565 (https://phabricator.wikimedia.org/T411781)
[11:26:48] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1233 gradually with 4 steps - Pool db1233.eqiad.wmnet in after cloning
[11:28:58] <Lucas_WMDE>	 does anyone know if we need to do something to fix a cronjob that failed due to the service mesh being unavailable? T411862
[11:28:59] <stashbot>	 T411862: MediaWiki periodic job wikidata-resubmit-changes-for-dispatch failed - https://phabricator.wikimedia.org/T411862
[11:30:12] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:31:14] <Lucas_WMDE>	 do we need to manually delete the failed job so that it’ll resume running periodically?
[11:31:30] * Lucas_WMDE may try that later but leaves some time for someone else to chime in first
[11:39:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:40:55] <wikibugs>	 (03PS2) 10Muehlenhoff: Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798)
[11:42:03] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1233.eqiad.wmnet onto db1229.eqiad.wmnet
[11:42:16] <logmsgbot>	 kubectl delete job wikidata-resubmit-changes-for-dispatch-29415459 # T411862
[11:42:17] <stashbot>	 T411862: MediaWiki periodic job wikidata-resubmit-changes-for-dispatch failed - https://phabricator.wikimedia.org/T411862
[11:42:33] <Lucas_WMDE>	 oops, that doesn’t include the user name or host name by default? TIL
[11:42:43] <Lucas_WMDE>	 wait, it doesn’t even include !log by default?
[11:42:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 kubectl delete job wikidata-resubmit-changes-for-dispatch-29415459 # T411862
[11:43:08] <Lucas_WMDE>	 ok that seems to have worked better
[11:43:18] <logmsgbot>	 --help
[11:43:22] <Lucas_WMDE>	 yeah I figured
[11:43:49] <Lucas_WMDE>	 so dologmsg is just a straight pipe to IRC with no bells and whistles. good to know
[11:44:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:45:54] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[11:46:01] <xSavitar>	 Lucas_WMDE, per https://phabricator.wikimedia.org/search/query/tXtQf80._Cf8/#R, we seem to have a handful of them recently. I was digging into this recently too and there is a useful conversation at T410764.
[11:46:01] <stashbot>	 T410764: MediaWiki periodic job startupregistrystats-mediawikiwiki failed - https://phabricator.wikimedia.org/T410764
[11:46:20] <xSavitar>	 It seems like the service affected just needs a restart but I'm no expert in that area
[11:46:50] <Lucas_WMDE>	 xSavitar: thanks, good to know
[11:46:52] <xSavitar>	 I think SRE/ServiceOps may have some ideas.
[11:47:05] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:47:07] <Lucas_WMDE>	 I don’t know why phaultfinder didn’t create a task for the alert I encountered (AFAICT)
[11:47:12] <xSavitar>	 Ack! But I agree, with you, it's happening quite a bit more frequently than normal recently.
[11:51:51] <wikibugs>	 (03PS3) 10Muehlenhoff: Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798)
[11:52:45] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[11:53:05] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[11:54:58] <xSavitar>	 Lucas_WMDE, I don't know why alert was not created for that either. Maybe something about wikidata needs to be added to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/%2B/refs/heads/master
[11:55:14] <xSavitar>	 I can see various teams sub-dir there
[11:55:37] <wikibugs>	 (03PS1) 10Btullis: Bump the mediawiki image used in the mediawiki-dumps-legacy-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215576 (https://phabricator.wikimedia.org/T405955)
[11:55:47] <xSavitar>	 https://gerrit.wikimedia.org/g/operations/alerts/+/dd2e90219cec0e4605ba0025bd496e01981f4603/team-sre/mw-cron.yaml
[11:56:06] <Lucas_WMDE>	 hm, no idea
[11:56:12] <Lucas_WMDE>	 I mean, the alert itself existed, I could see it on alerts.w.o
[11:56:18] <Lucas_WMDE>	 it just didn’t make a phab task
[11:56:38] <Lucas_WMDE>	 right, that’s probably where it came from, mw-cron.yaml
[11:57:53] <wikibugs>	 (03CR) 10Zoe: [C:03+1] "Yup, new key is working!" [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse)
[12:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251205T0800)
[12:00:05] <jouncebot>	 jelto, arnoldokoth, mutante, and arnaudb: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251205T1200). Please do the needful.
[12:00:57] <xSavitar>	 Lucas_WMDE, maybe you want to file a task
[12:05:34] * Lucas_WMDE tried to improve some docs at https://wikitech.wikimedia.org/wiki/dologmsg
[12:05:52] <Lucas_WMDE>	 xSavitar: eh, I’m okay with leaving it alone for now. maybe if it happens again
[12:05:55] <Lucas_WMDE>	 but thanks!
[12:08:02] <Lucas_WMDE>	 (also, I couldn’t find a general wikitech documentation page for !log usage in production… but maybe I missed it)
[12:10:02] <Lucas_WMDE>	 interestingly enough, the documentation suggests that at some point, `dologmsg` did add the !log prefix somewhere in the pipeline: https://gerrit.wikimedia.org/g/operations/puppet/+/63a8174ffd/modules/scap/files/manpages/asciidoc/dologmsg.txt#37
[12:10:13] <Lucas_WMDE>	 that, or the documentation was always misleading ^^
[12:14:56] <jinxer-wm>	 FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:17:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 07Documentation, 07Puppet (Puppet 7.0): Puppet7: Update documentation - https://phabricator.wikimedia.org/T341095#11436234 (10jcrespo) I would like to mention in particular workflows like renewal/revoking of certificates on server workflos, pa...
[12:30:01] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:30:21] <jayme>	 !log removed helm release mw-script/utk6lsuw in k8s@codfw which was in stuck in pending-install state since 9+ days
[12:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:25] <moritzm>	 !log upgrade python3-sshpubkeys on idm-test1001 to 3.3.1-1~wmf12u1 T411816
[12:42:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:29] <stashbot>	 T411816: cannot add a FIDO-backed ssh key to Bitu - https://phabricator.wikimedia.org/T411816
[12:53:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11436260 (10cmooney) 05Resolved→03Open Thanks for the work on this @MoritzMuehlenhoff!  From what I can see we still have a small number of these mails coming throug...
[13:04:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:05:28] <wikibugs>	 (03CR) 10Federico Ceratto: "Tested in T411805" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215116 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto)
[13:10:54] <moritzm>	 !log upload python3-sshpubkeys to 3.3.1-1~wmf12u1 to apt.wikimedia.org T411816
[13:10:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:57] <stashbot>	 T411816: cannot add a FIDO-backed ssh key to Bitu - https://phabricator.wikimedia.org/T411816
[13:22:28] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Bump the mediawiki image used in the mediawiki-dumps-legacy-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215576 (https://phabricator.wikimedia.org/T405955) (owner: 10Btullis)
[13:22:34] <wikibugs>	 (03CR) 10Elukey: Only select Puppet version based on the Debian release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[13:25:17] <wikibugs>	 (03CR) 10Muehlenhoff: Only select Puppet version based on the Debian release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[13:26:43] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Bump the mediawiki image used in the mediawiki-dumps-legacy-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215576 (https://phabricator.wikimedia.org/T405955) (owner: 10Btullis)
[13:26:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#11436294 (10taavi)
[13:28:23] <wikibugs>	 (03Merged) 10jenkins-bot: Bump the mediawiki image used in the mediawiki-dumps-legacy-toolbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215576 (https://phabricator.wikimedia.org/T405955) (owner: 10Btullis)
[13:30:33] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[13:31:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:xe-0/1/1 (Transport: cr2-eqord:xe-0/1/3 (Arelion, IC-313592 51ms 10Gbps wave) {#1062}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:32:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 20.41% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:33:00] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[13:33:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:36:40] <wikibugs>	 (03PS1) 10Federico Ceratto: db1229: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1215599 (https://phabricator.wikimedia.org/T411652)
[13:36:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:38:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:41:39] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply
[13:42:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 14.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[13:43:12] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply
[13:43:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:43:35] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply
[13:46:29] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply
[13:46:30] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply
[13:46:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply
[13:47:01] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook-next: apply
[13:49:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply
[13:49:47] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook-next: apply
[13:49:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:50:25] <jayme>	 now that is annoying...
[13:50:44] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply
[13:51:23] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply
[13:52:29] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply
[13:52:54] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply
[14:02:19] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: add smoke tests [WIP] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605
[14:02:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest gateway: add smoke tests [WIP] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler)
[14:03:32] <wikibugs>	 (03CR) 10Jelto: [C:03+2] vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth)
[14:04:07] <wikibugs>	 (03CR) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey)
[14:05:06] <wikibugs>	 (03Merged) 10jenkins-bot: vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth)
[14:08:39] <jayme>	 !log stopped puppet on wikikube-ctrl2* and restarted kube-apiserver to temporarily extend audit logging
[14:08:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:49] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[14:12:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[14:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:21:51] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T410589)', diff saved to https://phabricator.wikimedia.org/P86425 and previous config saved to /var/cache/conftool/dbconfig/20251205-142150-ladsgroup.json
[14:21:54] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[14:22:54] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply
[14:23:04] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply
[14:25:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply
[14:25:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply
[14:26:27] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[14:27:30] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[14:33:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:36:59] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P86426 and previous config saved to /var/cache/conftool/dbconfig/20251205-143658-ladsgroup.json
[14:39:42] <wikibugs>	 (03PS2) 10Btullis: Bump spark image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215556 (https://phabricator.wikimedia.org/T410017)
[14:41:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul)
[14:42:36] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Bump spark image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215556 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[14:44:05] <wikibugs>	 06SRE, 06collaboration-services, 10vrts, 10Znuny, 07Wikimedia-Incident: No space left on device on VRTS host - https://phabricator.wikimedia.org/T411452#11436505 (10Jelto) 05Open→03Resolved a:03Arnoldokoth Thanks @Arnoldokoth for enabling the cleanup job again. The [inode metrics](https://grafa...
[14:44:39] <wikibugs>	 (03Merged) 10jenkins-bot: Bump spark image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215556 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[14:45:57] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[14:46:05] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[14:50:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply
[14:51:02] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply
[14:51:22] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:51:55] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[14:52:06] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P86427 and previous config saved to /var/cache/conftool/dbconfig/20251205-145206-ladsgroup.json
[14:52:20] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[14:55:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:55:30] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply
[14:55:30] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[14:56:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply
[14:57:01] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "Icinga is green" [puppet] - 10https://gerrit.wikimedia.org/r/1215599 (https://phabricator.wikimedia.org/T411652) (owner: 10Federico Ceratto)
[15:02:12] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:02:12] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:07:14] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T410589)', diff saved to https://phabricator.wikimedia.org/P86428 and previous config saved to /var/cache/conftool/dbconfig/20251205-150713-ladsgroup.json
[15:07:17] <stashbot>	 T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589
[15:07:30] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[15:07:39] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1189 (T410589)', diff saved to https://phabricator.wikimedia.org/P86429 and previous config saved to /var/cache/conftool/dbconfig/20251205-150737-ladsgroup.json
[15:08:12] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:10:01] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:11] <wikibugs>	 (03PS1) 10Superpes15: [enwikibooks] Allow sysops to revert abusefilter and grant/revoke confirmed and accountcreator flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215625 (https://phabricator.wikimedia.org/T411828)
[15:12:02] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:12:02] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:27:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 16.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:29:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:30:12] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:30:55] <Amir1>	 !log creating ores tables on thwiki (T409438)
[15:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:59] <stashbot>	 T409438: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438
[15:32:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 24.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:34:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:34:23] <wikibugs>	 (03PS1) 10Btullis: Remove incorrect hive.server2 settings and correct the k8s URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215627 (https://phabricator.wikimedia.org/T406833)
[15:35:01] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:36:46] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] db1229: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1215599 (https://phabricator.wikimedia.org/T411652) (owner: 10Federico Ceratto)
[15:38:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:42:02] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Remove incorrect hive.server2 settings and correct the k8s URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215627 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis)
[15:44:08] <wikibugs>	 (03Merged) 10jenkins-bot: Remove incorrect hive.server2 settings and correct the k8s URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215627 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis)
[15:47:05] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[15:50:29] <wikibugs>	 (03CR) 10Papaul: [C:03+2] "Email sent" [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul)
[15:58:23] <wikibugs>	 (03Abandoned) 10Hashar: gerrit: add a layer of CNAME to ease switch overs [dns] - 10https://gerrit.wikimedia.org/r/1210560 (https://phabricator.wikimedia.org/T387833) (owner: 10Hashar)
[16:02:16] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11436743 (10Dzahn)
[16:03:18] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[16:03:25] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[16:06:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "got the email but see it's already resolved:)" [puppet] - 10https://gerrit.wikimedia.org/r/1215373 (https://phabricator.wikimedia.org/T411833) (owner: 10Papaul)
[16:11:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:13:40] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Only select Puppet version based on the Debian release [puppet] - 10https://gerrit.wikimedia.org/r/1214564 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[16:13:48] <wikibugs>	 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11436784 (10Dzahn) >>! In T408592#11386273, @ATitkov wrote: > Forgot to add the current repo [[ https://gitlab.wikimedia.org/toolforge-repos/...
[16:14:56] <jinxer-wm>	 FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:17:47] <wikibugs>	 (03CR) 10Dzahn: "per discussion today: at least initially we are not going to move donate.wikipedia25.org - it will stay in the ncredir cluster with the ex" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:18:34] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "if you feel like deploying this, go ahead. if not I will get back to it Tuesday. or we can do it in the Tuesday meeting." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn)
[16:19:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:22:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] trafficserver: add a map for gerrit as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1215317 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn)
[16:25:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[16:39:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:48:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Wrong disk order on ml-lab1001? - https://phabricator.wikimedia.org/T411753#11436870 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Removed 2x drives   4:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sda [5:0:0:0] disk ATA Micron_5400_MTFD U002 /dev/sdc
[16:50:33] <wikibugs>	 (03PS3) 10CDanis: lvs7003: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215388
[16:50:33] <wikibugs>	 (03PS9) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[16:50:33] <wikibugs>	 (03PS2) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398
[16:50:48] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[16:50:51] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (owner: 10CDanis)
[16:51:11] <wikibugs>	 (03CR) 10CDanis: "Yes, you are very right, and apologies for forgetting this after the first time you pointed it out." [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[16:52:05] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[16:56:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:01:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:02:20] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:02:42] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:07:07] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:10:03] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:10:16] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:10:49] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:11:02] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-lab1001.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART
[17:12:23] <wikibugs>	 (03CR) 10Majavah: [C:03+1] P:toolforge:prometheus: scrape mariadb metrics [puppet] - 10https://gerrit.wikimedia.org/r/1215121 (https://phabricator.wikimedia.org/T410505) (owner: 10FNegri)
[17:14:23] <wikibugs>	 (03CR) 10Elukey: ml-build: define new machine name/type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski)
[17:18:18] <wikibugs>	 (03CR) 10Dzahn: "yea, in our blackbox monitoring checks we also accept "200 OR 302"" [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[17:18:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (owner: 10CDanis)
[17:23:35] <topranks>	 !log add updated ssh firewall filter config to pfw1-eqiad.wikimedia.org T390939
[17:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411818#11436967 (10Jclark-ctr) @cmooney I might have closed out T411684 prematurely. I had noticed the spike occurred at the...
[17:27:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 9.539% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:28:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:29:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:32:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at codfw: 23.37% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:33:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:38:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:40:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411818#11437045 (10Jclark-ctr) a:03Jclark-ctr
[17:42:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <ENTER RESOURCE NAME> for <ENTER YOUR USERNAME> - https://phabricator.wikimedia.org/T411883 (10Leif_WMDE) 03NEW
[17:42:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <ENTER RESOURCE NAME> for <ENTER YOUR USERNAME> - https://phabricator.wikimedia.org/T411883#11437057 (10Leif_WMDE) a:03Lena_WMDE
[17:44:45] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:45:00] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:49:45] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:51:59] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215663
[18:02:10] <wikibugs>	 (03PS1) 10Btullis: Add an extra egress rule for airflow-main to allow uploading to frack s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740)
[18:06:18] <wikibugs>	 (03Abandoned) 10Jforrester: Wikifunctions SLO: Adjust upper bucket to 10.1s to cover slow reporting [puppet] - 10https://gerrit.wikimedia.org/r/1192609 (https://phabricator.wikimedia.org/T394057) (owner: 10Jforrester)
[18:06:52] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add an extra egress rule for airflow-main to allow uploading to frack s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis)
[18:08:55] <wikibugs>	 (03Merged) 10jenkins-bot: Add an extra egress rule for airflow-main to allow uploading to frack s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis)
[18:10:34] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[18:10:52] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[18:16:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:17:58] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply
[18:18:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply
[18:21:20] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11437166 (10Pppery)
[18:21:47] <wikibugs>	 (03PS1) 10Btullis: Add a certificate for the frtech root CA to airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740)
[18:22:55] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis)
[18:27:20] <wikibugs>	 (03PS2) 10Btullis: Add a certificate for the frtech root CA to airflow-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740)
[18:27:59] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[18:28:31] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[18:34:06] <wikibugs>	 (03PS1) 10Jelto: gerrit: allow https traffic to both interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259)
[18:36:22] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7797/co" [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[18:36:37] <wikibugs>	 (03CR) 10Dzahn: "oh yes, absolutely. good point/catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[18:38:44] <wikibugs>	 (03CR) 10CDanis: [C:03+1] "thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[18:40:44] <wikibugs>	 (03PS3) 10Btullis: Add a certificate and an S3 connection to airflow-main for frtech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740)
[18:48:35] <wikibugs>	 (03PS4) 10Btullis: Add a certificate and an S3 connection to airflow-main for frtech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740)
[18:51:34] <jinxer-wm>	 FIRING: DiskSpace: Disk space serpens:9100:/ 3.153% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[18:52:08] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add a certificate and an S3 connection to airflow-main for frtech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis)
[18:54:10] <wikibugs>	 (03Merged) 10jenkins-bot: Add a certificate and an S3 connection to airflow-main for frtech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215669 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis)
[18:55:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:56:25] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1215673/7798/" [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[18:56:59] <wikibugs>	 (03CR) 10CDanis: [C:03+2] gerrit: allow https traffic to both interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1215673 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto)
[19:09:36] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[19:10:14] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[19:22:36] <wikibugs>	 (03PS2) 10Dzahn: admin: Remove unused SSH key for Zoe [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse)
[19:22:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: Remove unused SSH key for Zoe [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse)
[19:30:12] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:31:05] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse)
[19:33:28] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting update of SSH key for zoe - https://phabricator.wikimedia.org/T411506#11437363 (10Dzahn) I think this is now resolved.
[19:34:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add FIDO backed production SSH key for Papaul - https://phabricator.wikimedia.org/T411833#11437365 (10Dzahn) This seems to be resolved now.
[19:38:35] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for Leif WMDE - https://phabricator.wikimedia.org/T411883#11437372 (10Dzahn) Hello @Leif_WMDE   you can kick-off the process early by sending an email to [[ https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) | Katie Fra...
[19:39:05] <wikibugs>	 (03PS1) 10CDanis: ats: gerrit: don't validate TLS host for now [puppet] - 10https://gerrit.wikimedia.org/r/1215684
[19:39:27] <wikibugs>	 (03PS2) 10CDanis: ats: gerrit: don't validate TLS host for now [puppet] - 10https://gerrit.wikimedia.org/r/1215684
[19:39:41] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (owner: 10CDanis)
[19:41:34] <jinxer-wm>	 RESOLVED: DiskSpace: Disk space serpens:9100:/ 0.5635% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=serpens - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[19:43:50] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "not claiming to review the actual ATS config but the problem is clear and a blocker and since this only touches the gerrit map that has ju" [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (owner: 10CDanis)
[19:44:47] <wikibugs>	 (03PS1) 10Btullis: Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686
[19:45:41] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [C:03+1] Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686 (owner: 10Btullis)
[19:46:12] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686 (owner: 10Btullis)
[19:46:20] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686 (owner: 10Btullis)
[19:47:05] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[19:48:22] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add a certificate and an S3 connection to airflow-main for frtech" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215686 (owner: 10Btullis)
[19:49:33] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[19:50:24] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[20:10:25] <wikibugs>	 (03PS1) 10Scott French: shellbox-constraints: bump replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215672
[20:10:28] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-constraints: bump replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215672 (owner: 10Scott French)
[20:12:42] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-constraints: bump replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215672 (owner: 10Scott French)
[20:14:56] <jinxer-wm>	 FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:16:43] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[20:17:11] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[20:18:13] <wikibugs>	 (03PS3) 10CDanis: ats: gerrit: don't validate TLS host for now [puppet] - 10https://gerrit.wikimedia.org/r/1215684
[20:18:14] <wikibugs>	 (03PS4) 10CDanis: lvs7003: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215388
[20:18:14] <wikibugs>	 (03PS10) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389
[20:18:14] <wikibugs>	 (03PS3) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398
[20:21:41] <wikibugs>	 (03PS4) 10CDanis: ats: gerrit: don't validate TLS host for now [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895)
[20:21:43] <wikibugs>	 (03PS5) 10CDanis: lvs7003: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895)
[20:21:45] <wikibugs>	 (03PS11) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895)
[20:21:47] <wikibugs>	 (03PS4) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895)
[20:21:49] <wikibugs>	 (03PS1) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895)
[20:38:47] <wikibugs>	 (03PS1) 10Dzahn: add gerrit-ssh and gerrit-https to liberica services on lvs7003 [puppet] - 10https://gerrit.wikimedia.org/r/1215699 (https://phabricator.wikimedia.org/T411895)
[20:40:37] <wikibugs>	 (03PS6) 10CDanis: lvs7003: add gerrit-ssh and gerrit-https [puppet] - 10https://gerrit.wikimedia.org/r/1215388 (https://phabricator.wikimedia.org/T411895)
[20:40:37] <wikibugs>	 (03PS12) 10CDanis: gerrit services: lvs_setup! but only in magru. [puppet] - 10https://gerrit.wikimedia.org/r/1215389 (https://phabricator.wikimedia.org/T411895)
[20:40:37] <wikibugs>	 (03PS5) 10CDanis: lvs7001: add gerrit services [puppet] - 10https://gerrit.wikimedia.org/r/1215398 (https://phabricator.wikimedia.org/T411895)
[20:40:38] <wikibugs>	 (03PS2) 10CDanis: gerrit/Liberica: expand to drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1215693 (https://phabricator.wikimedia.org/T411895)
[20:40:55] <wikibugs>	 (03Abandoned) 10Dzahn: add gerrit-ssh and gerrit-https to liberica services on lvs7003 [puppet] - 10https://gerrit.wikimedia.org/r/1215699 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn)
[21:03:05] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[21:03:19] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[21:04:57] <wikibugs>	 (03CR) 10Dzahn: "in the future this should be reverted for https://phabricator.wikimedia.org/T411904" [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis)
[21:14:00] <wikibugs>	 (03PS1) 10Dzahn: switch gerrit service IP to CDN [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895)
[21:16:24] <wikibugs>	 (03CR) 10Dzahn: "not quite yet but not far away" [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn)
[21:18:55] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[21:19:26] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[21:29:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:38:33] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[21:48:59] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[21:50:55] <wikibugs>	 (03CR) 10BCornwall: switch gerrit service IP to CDN (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn)
[21:52:07] <wikibugs>	 (03CR) 10Dzahn: switch gerrit service IP to CDN (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895) (owner: 10Dzahn)
[21:52:32] <wikibugs>	 (03PS2) 10Dzahn: switch gerrit service IP to CDN [dns] - 10https://gerrit.wikimedia.org/r/1215709 (https://phabricator.wikimedia.org/T411895)
[21:56:37] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[21:57:48] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[22:06:39] <wikibugs>	 (03PS1) 10Aleksandar Mastilovic: Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717
[22:07:07] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717 (owner: 10Aleksandar Mastilovic)
[22:07:49] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [C:03+1] Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717 (owner: 10Aleksandar Mastilovic)
[22:09:09] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717 (owner: 10Aleksandar Mastilovic)
[22:09:10] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] Add an extra egress rule for airflow-main to allow uploading to frack s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215665 (https://phabricator.wikimedia.org/T411740) (owner: 10Btullis)
[22:10:04] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] Revert "Add an extra egress rule for airflow-main to allow uploading to frack s3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215717 (owner: 10Aleksandar Mastilovic)
[22:10:56] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[22:11:03] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[22:11:41] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[22:11:59] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[22:16:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:29:21] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[22:29:27] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[22:31:57] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[22:32:03] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[22:33:43] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[22:34:00] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[22:34:22] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[22:35:06] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[22:55:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:30:12] <jinxer-wm>	 FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:47:05] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown