[00:15:28] (03PS1) 10Bartosz Dziewoński: Fix displaying events with IP agents [extensions/Echo] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1304222 (https://phabricator.wikimedia.org/T428198) [00:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [01:06:34] !log roll restart eventgate-analytics to pick up stream config change - T427787 [01:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:38] T427787: wdqs: design and declare an event platform schema for WDQS v2 logs. - https://phabricator.wikimedia.org/T427787 [01:12:36] (03CR) 10Dzahn: "😎" [puppet] - 10https://gerrit.wikimedia.org/r/1304140 (https://phabricator.wikimedia.org/T426995) (owner: 10Dzahn) [01:12:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1304238 [01:12:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1304238 (owner: 10TrainBranchBot) [01:17:39] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [01:17:43] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [01:17:51] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync [01:18:27] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync [01:18:53] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync [01:19:25] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync [01:21:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1304238 (owner: 10TrainBranchBot) [01:22:53] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 1 for chudson - https://phabricator.wikimedia.org/T429353#12035976 (10Ottomata) Approved! But FYI data eng [[ https://github.com/wikimedia/operations-puppet/blob/production/modules/a... [01:24:39] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:29:39] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:39] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:53:38] (03PS2) 10JHathaway: WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 [01:58:31] (03CR) 10CI reject: [V:04-1] WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 (owner: 10JHathaway) [02:00:43] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:33] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 49s) [02:09:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:22] (03PS4) 10Krinkle: Disable ShortUrl on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303006 (https://phabricator.wikimedia.org/T107188) [02:19:22] (03PS1) 10Krinkle: Undeploy the ShortUrl extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304247 (https://phabricator.wikimedia.org/T107188) [02:25:35] (03PS3) 10Krinkle: [WIP] Disable ShortURL everywhere, without migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184153 (owner: 10Jforrester) [02:25:45] (03CR) 10CI reject: [V:04-1] [WIP] Disable ShortURL everywhere, without migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184153 (owner: 10Jforrester) [02:25:56] (03Abandoned) 10Krinkle: [WIP] Disable ShortURL everywhere, without migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184153 (owner: 10Jforrester) [02:26:38] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [02:37:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:39] FIRING: [5x] SystemdUnitFailed: cowbuilder_update_bookworm-amd64.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [04:24:39] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:29:39] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:55:23] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2247 - https://phabricator.wikimedia.org/T429348#12036090 (10Marostegui) All looking good on my end too! Thanks [04:56:47] (03PS1) 10Marostegui: pc2022: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1304270 [05:03:44] (03CR) 10Marostegui: [C:03+2] pc2022: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1304270 (owner: 10Marostegui) [05:27:04] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#12036116 (10RKemper) >>! In T276088#12028089, @elukey wrote: > We have a Kafka upgrade Cloud project that we could use with Pontoon... [05:49:39] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260619T0600) [06:02:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1304201 (https://phabricator.wikimedia.org/T429353) (owner: 10Ladsgroup) [06:24:55] (03PS1) 10Slyngshede: CAS 7.3.7.3 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1304500 [06:26:38] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [06:29:28] 06SRE, 06Infrastructure-Foundations: Migrate diffscan VM to Trixie - https://phabricator.wikimedia.org/T415347#12036186 (10ayounsi) First run of the PeeringDB script seems to have worked fine (but it's just the init phase). However diffscan is failing with: ` 2026-06-18T11:59:40.818472+00:00 diffscan03 diffsc... [06:31:37] (03PS1) 10Muehlenhoff: Fix permissions to /srv/homer/public/definitions after checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304504 (https://phabricator.wikimedia.org/T427897) [06:32:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1304500 (owner: 10Slyngshede) [06:32:09] (03CR) 10Slyngshede: [V:03+2 C:03+2] CAS 7.3.7.3 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1304500 (owner: 10Slyngshede) [06:37:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:38:18] (03PS2) 10Muehlenhoff: Fix permissions to /srv/homer/public/definitions after checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304504 (https://phabricator.wikimedia.org/T427897) [06:47:06] (03PS1) 10Jelto: gerrit: increase thresholds for GerritHigh4xxRatio alert [alerts] - 10https://gerrit.wikimedia.org/r/1304506 (https://phabricator.wikimedia.org/T428979) [06:50:19] (03CR) 10Elukey: "Re-reading past Janis' comments - it seems that at the time we didn't have Dragonfly, and we relied only on the registry's caching. Things" [puppet] - 10https://gerrit.wikimedia.org/r/1304060 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [06:59:42] (03PS1) 10Elukey: docker_registry: remove support for the nginx blob cache [puppet] - 10https://gerrit.wikimedia.org/r/1304512 (https://phabricator.wikimedia.org/T427175) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260619T0700) [07:00:24] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1304512 (https://phabricator.wikimedia.org/T427175) (owner: 10Elukey) [07:29:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1303380 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [07:29:51] (03CR) 10Slyngshede: [C:03+2] IDP: Bump local version, 7.3.7.2+wmf13u2 [dns] - 10https://gerrit.wikimedia.org/r/1303380 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [07:30:03] !log slyngshede@dns1004 START - running authdns-update [07:31:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:31:51] !log slyngshede@dns1004 END - running authdns-update [07:32:21] !log Update IDP/SSO to CAS v7.3.7.3 [07:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1304504 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [07:33:11] (03PS1) 10Kevin Bazira: ml: Rebuild vLLM base images to use deb.debian.org in apt sources [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304514 (https://phabricator.wikimedia.org/T429667) [07:34:30] (03CR) 10CWilliams: [C:03+1] major-upgrade.py: Add !log dbmaint on the start [cookbooks] - 10https://gerrit.wikimedia.org/r/1303438 (owner: 10Marostegui) [07:34:39] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:35:57] (03PS1) 10Muehlenhoff: Fix update config for wmfdb [puppet] - 10https://gerrit.wikimedia.org/r/1304516 (https://phabricator.wikimedia.org/T427900) [07:36:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:37:38] (03CR) 10Brouberol: [C:03+2] data-platform: add alert on kafka-jumbo partition sizes [alerts] - 10https://gerrit.wikimedia.org/r/1302737 (https://phabricator.wikimedia.org/T429127) (owner: 10Brouberol) [07:39:39] FIRING: [5x] SystemdUnitFailed: cowbuilder_update_bookworm-amd64.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:30] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386#12036327 (10cmooney) @Papaul FYI for any upgrades next week let's use 23.4R2-S8, older versions have this nasty bug. [07:57:29] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-esams rpd failure after enabling bgp 'graceful-shutdown' (June 2026) - https://phabricator.wikimedia.org/T429386#12036329 (10cmooney) Also recording JTAC case number is 2026-0616-761841 [07:59:39] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:12:39] 06SRE, 06Data-Platform-SRE: Redeploy cirrus-streaming-updater/producer and cirrus-streaming-updater/consumer to pick up current mirror - https://phabricator.wikimedia.org/T429671 (10MoritzMuehlenhoff) 03NEW [08:12:56] 06SRE, 06Data-Platform-SRE: Redeploy cirrus-streaming-updater/producer and cirrus-streaming-updater/consumer to pick up current mirror - https://phabricator.wikimedia.org/T429671#12036369 (10MoritzMuehlenhoff) p:05Triage→03High [08:14:39] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:14:41] (03PS11) 10Effie Mouzeli: Add /llms.txt where honest robots can read our API Policy #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303454 (https://phabricator.wikimedia.org/T429599) [08:15:20] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1304516 (https://phabricator.wikimedia.org/T427900) (owner: 10Muehlenhoff) [08:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [08:18:40] (03CR) 10Elukey: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1304512" [puppet] - 10https://gerrit.wikimedia.org/r/1304060 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [08:18:47] 10SRE-tools, 06Infrastructure-Foundations, 10Packaging: Upgrade prometheus-atlas-exporter - https://phabricator.wikimedia.org/T429672 (10ayounsi) 03NEW p:05Triage→03Low [08:23:07] (03PS9) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) [08:24:17] (03CR) 10Trueg: dse-k8s-services: Enable ingress on WDQS namespaces (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [08:26:24] (03CR) 10Btullis: [C:03+2] wikibase: fixed bash syntax error [dumps] - 10https://gerrit.wikimedia.org/r/1304029 (https://phabricator.wikimedia.org/T425036) (owner: 10Trueg) [08:33:39] 06SRE: Redeploy Liftwing pick up current mirror - https://phabricator.wikimedia.org/T429675 (10MoritzMuehlenhoff) 03NEW [08:37:40] (03CR) 10CWilliams: [C:03+2] mariadb: Support argument for mysql-section.sh [puppet] - 10https://gerrit.wikimedia.org/r/1304065 (https://phabricator.wikimedia.org/T429613) (owner: 10CWilliams) [08:40:37] 06SRE: Redeploy Liftwing pick up current mirror - https://phabricator.wikimedia.org/T429675#12036455 (10elukey) @isarantopoulos @achou Hi! Do you have time to rebuild the aforementioned images to pick up the new changes (and deploy them) ? [08:51:57] (03PS12) 10Ayounsi: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 (https://phabricator.wikimedia.org/T415347) (owner: 10Jbond) [08:52:34] (03PS4) 10Giuseppe Lavagetto: hiddenparma: switch to native CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/1299475 (https://phabricator.wikimedia.org/T422235) [08:53:00] (03CR) 10Muehlenhoff: [C:03+2] Fix update config for wmfdb [puppet] - 10https://gerrit.wikimedia.org/r/1304516 (https://phabricator.wikimedia.org/T427900) (owner: 10Muehlenhoff) [08:54:00] cezmunsta: okay to merge your "mariadb: Support argument for mysql-section-sh" commit alongside? [08:54:44] moritzm: yep, thanks [08:59:46] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate diffscan VM to Trixie - https://phabricator.wikimedia.org/T415347#12036490 (10ayounsi) Disabled puppet, manually copied that refactored version: https://gerrit.wikimedia.org/r/c/operations/puppet/+/634572/11/modules/diffscan/files/diffscan.py T... [09:00:17] ack, merged now [09:00:39] (03CR) 10Ayounsi: [C:03+1] Fix permissions to /srv/homer/public/definitions after checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304504 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [09:04:14] (03CR) 10Santiago Faci: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [09:04:45] 06SRE: Redeploy Liftwing pick up current mirror - https://phabricator.wikimedia.org/T429675#12036516 (10kevinbazira) @elukey we are tackling this in T429667, the goal is to: * Rebuild vLLM base images to use deb.debian.org in apt sources * Update LLM blubberfiles to use the rebuilt vLLM base images [09:06:19] 06SRE: Redeploy Liftwing pick up current mirror - https://phabricator.wikimedia.org/T429675#12036524 (10MoritzMuehlenhoff) Thanks! [09:06:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [09:08:14] 10SRE-tools, 06Infrastructure-Foundations: Diffscan: investigate IPv6 support and explore other scanning tooling - https://phabricator.wikimedia.org/T265329#12036538 (10ayounsi) Working on T415347, I'm seeing this in the `diffscan-cloud-infrastructure` instance logs. ` Jun 19 08:57:42 diffscan03 diffscan[86132... [09:10:00] (03PS8) 10Tiziano Fogli: slothslos/report2drive: add modules [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) [09:10:00] (03PS10) 10Tiziano Fogli: slothslos/report2drive: add profiles [puppet] - 10https://gerrit.wikimedia.org/r/1298295 (https://phabricator.wikimedia.org/T425795) [09:10:01] (03PS10) 10Tiziano Fogli: slothslos/report2drive: instantiate resources [puppet] - 10https://gerrit.wikimedia.org/r/1298296 (https://phabricator.wikimedia.org/T425795) [09:10:01] (03PS10) 10Tiziano Fogli: slothslos/report2drive: add Hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/1298297 (https://phabricator.wikimedia.org/T425795) [09:10:02] (03PS10) 10Tiziano Fogli: slothslos/report2drive: enable deep merge for vars [puppet] - 10https://gerrit.wikimedia.org/r/1298298 (https://phabricator.wikimedia.org/T425795) [09:16:26] (03CR) 10Elukey: "One small change and we are good to go!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304514 (https://phabricator.wikimedia.org/T429667) (owner: 10Kevin Bazira) [09:21:40] (03PS2) 10Kevin Bazira: ml: Rebuild vLLM base images to use deb.debian.org in apt sources [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304514 (https://phabricator.wikimedia.org/T429667) [09:22:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:25:54] (03CR) 10Elukey: [C:03+1] ml: Rebuild vLLM base images to use deb.debian.org in apt sources (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304514 (https://phabricator.wikimedia.org/T429667) (owner: 10Kevin Bazira) [09:31:56] (03CR) 10Elukey: "The only remaining comments are related to the ERB files, then we are good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [09:33:04] (03CR) 10Elukey: [V:03+2 C:03+2] ml: Rebuild vLLM base images to use deb.debian.org in apt sources [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304514 (https://phabricator.wikimedia.org/T429667) (owner: 10Kevin Bazira) [09:33:54] (03CR) 10Tiziano Fogli: "The script’s error handling is designed to continue report generation and uploads as much as possible, given that it is a rather slow proc" [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [09:34:41] 06SRE, 06DBA, 10observability, 06SRE Observability: Alerts showing "AlertLintProblem" - MySQLReplicaNotUsingGTID - https://phabricator.wikimedia.org/T427469#12036590 (10Marostegui) 05Open→03Resolved There have been no more alerts with this so closing - we can reopen if it happens again. Thanks for... [09:36:39] (03CR) 10Elukey: [C:03+1] slothslos/report2drive: add modules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [09:36:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1020.eqiad.wmnet [09:43:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1020.eqiad.wmnet [09:47:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:49:39] FIRING: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:54:19] FIRING: [2x] HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:54:38] !log cmooney@cumin1003 START - Cookbook sre.network.host-bgp for host dse-k8s-worker1020 [09:56:06] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.host-bgp (exit_code=0) for host dse-k8s-worker1020 [09:56:32] 07sre-alert-triage, 10Wikifunctions: Alert in need of triage: ProbeDown (instance wikifunctions-python-evaluator-staging:30443) - https://phabricator.wikimedia.org/T429678 (10LSobanski) 03NEW [09:57:51] !log btullis@cumin1003 START - Cookbook sre.network.host-bgp for host dse-k8s-worker1022 [09:58:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.host-bgp (exit_code=0) for host dse-k8s-worker1022 [09:59:09] !log cmooney@cumin1003 START - Cookbook sre.network.host-bgp for host dse-k8s-worker1024 [10:00:01] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.host-bgp (exit_code=0) for host dse-k8s-worker1024 [10:03:06] !log cmooney@cumin1003 START - Cookbook sre.network.host-bgp for host dse-k8s-worker1021 [10:03:16] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.host-bgp (exit_code=0) for host dse-k8s-worker1021 [10:03:21] !log cmooney@cumin1003 START - Cookbook sre.network.host-bgp for host dse-k8s-worker1023 [10:03:42] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.host-bgp (exit_code=0) for host dse-k8s-worker1023 [10:03:47] !log cmooney@cumin1003 START - Cookbook sre.network.host-bgp for host dse-k8s-worker1024 [10:04:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.host-bgp (exit_code=0) for host dse-k8s-worker1024 [10:09:49] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1020.eqiad.wmnet [10:14:02] (03CR) 10Tiziano Fogli: [C:03+1] slothslos/report2drive: add modules (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1298294 (https://phabricator.wikimedia.org/T425795) (owner: 10Tiziano Fogli) [10:16:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1020.eqiad.wmnet [10:20:35] (03CR) 10Muehlenhoff: [C:04-1] Fix permissions to /srv/homer/public/definitions after checkout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1304504 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [10:23:56] (03CR) 10Elukey: sre.hosts.provision: introduce the wmfroot user (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [10:24:39] RESOLVED: PuppetFailure: Puppet has failed on cumin2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:25:55] (03PS1) 10Muehlenhoff: homer: Don't explicitly set a file mode for the srv/homer checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304532 (https://phabricator.wikimedia.org/T427897) [10:29:12] !log Run `MigrateMentorStatusAway` script for all wikis in growthexperiments dblist - T409170 [10:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:16] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [10:33:28] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs2001.codfw.wmnet [10:33:33] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs2002.codfw.wmnet [10:33:37] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs2003.codfw.wmnet [10:33:40] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs2004.codfw.wmnet [10:37:14] !log imported nodejs 22.23.0-1nodesource1 to thirdparty/node22 for trixie-wikimedia [10:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:00] !log imported nodejs 24.17.0-1nodesource1 to thirdparty/node24 for trixie-wikimedia [10:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:25] (03PS1) 10Muehlenhoff: nodejs-22/nodejs-24: Bump versions to latest Node security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304533 [10:52:52] (03CR) 10Santiago Faci: Add phabricator api token for Test Kitchen (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [10:53:06] (03CR) 10Santiago Faci: Add phabricator api token for Test Kitchen (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [10:54:38] (03CR) 10Ayounsi: [C:03+1] homer: Don't explicitly set a file mode for the srv/homer checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304532 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260619T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260619T1100). [11:10:02] (03PS1) 10Kamila Součková: services/*: Fix early inclusion of clusterinfo values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304545 (https://phabricator.wikimedia.org/T388390) [11:12:30] (03CR) 10Kamila Součková: "not sure how this happened, sorry" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304545 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [11:14:31] (03CR) 10Clément Goubert: [C:03+1] services/*: Fix early inclusion of clusterinfo values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304545 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [11:21:21] (03CR) 10Kamila Součková: [C:03+2] services/*: Fix early inclusion of clusterinfo values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304545 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [11:23:57] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:32:13] (03PS2) 10Muehlenhoff: homer: Don't explicitly set a file mode for the srv/homer checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304532 (https://phabricator.wikimedia.org/T427897) [11:32:59] (03Merged) 10jenkins-bot: services/*: Fix early inclusion of clusterinfo values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304545 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [11:33:35] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:33:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1304532 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [11:39:39] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:16] (03CR) 10Muehlenhoff: [C:03+2] homer: Don't explicitly set a file mode for the srv/homer checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304532 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [11:44:19] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:49:38] (03PS1) 10Ayounsi: netbox: add a BGP getter/setter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304554 [11:50:02] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:52:32] (03CR) 10Elukey: nodejs-22/nodejs-24: Bump versions to latest Node security releases (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304533 (owner: 10Muehlenhoff) [11:54:26] (03CR) 10CI reject: [V:04-1] netbox: add a BGP getter/setter [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304554 (owner: 10Ayounsi) [11:55:24] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12036771 (10MoritzMuehlenhoff) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1304532 fixed the permission issues we saw when running homer as a normal user. There's howeve... [11:55:56] (03Abandoned) 10Muehlenhoff: Fix permissions to /srv/homer/public/definitions after checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304504 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [11:56:34] FIRING: [2x] HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:58:52] (03CR) 10Muehlenhoff: nodejs-22/nodejs-24: Bump versions to latest Node security releases (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304533 (owner: 10Muehlenhoff) [12:01:10] (03CR) 10Ayounsi: Cookbook to enable BGP for a given host and configure network (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1304137 (https://phabricator.wikimedia.org/T429488) (owner: 10Cathal Mooney) [12:02:32] (03CR) 10Elukey: nodejs-22/nodejs-24: Bump versions to latest Node security releases (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304533 (owner: 10Muehlenhoff) [12:04:00] (03PS2) 10Muehlenhoff: nodejs-22/nodejs-24: Bump versions to latest Node security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304533 [12:04:23] (03CR) 10Muehlenhoff: nodejs-22/nodejs-24: Bump versions to latest Node security releases (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304533 (owner: 10Muehlenhoff) [12:04:27] !log urbanecm@deploy1003 mwscript-k8s job started: GrowthExperiments:MigrateMentorStatusAway --wiki=viwiki # T409170 [12:04:32] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [12:05:13] !log urbanecm@deploy1003 mwscript-k8s job started: GrowthExperiments:migrateMentorStatusAway.php --wiki=viwiki # T409170 [12:08:37] !log aokoth@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on phab2002.codfw.wmnet with reason: Host Replacement [12:10:06] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:14:39] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:47] (03PS1) 10Muehlenhoff: homer: Don't explicitly set a file mode for the private homer checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304558 (https://phabricator.wikimedia.org/T427897) [12:15:20] (03CR) 10CI reject: [V:04-1] homer: Don't explicitly set a file mode for the private homer checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304558 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [12:16:32] (03PS2) 10Muehlenhoff: homer: Don't explicitly set a file mode for the private homer checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304558 (https://phabricator.wikimedia.org/T427897) [12:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:18:10] (03CR) 10Joal: "Let's ask @btullis@wikimedia.org what he thinks, but IMO we should extract the "waiting" algorithm to a dedicated file instead of copying " [puppet] - 10https://gerrit.wikimedia.org/r/1303460 (https://phabricator.wikimedia.org/T425385) (owner: 10Snwachukwu) [12:19:15] (03CR) 10Jforrester: [C:03+1] nodejs-22/nodejs-24: Bump versions to latest Node security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304533 (owner: 10Muehlenhoff) [12:21:03] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2160.codfw.wmnet [12:21:03] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2160.codfw.wmnet [12:21:12] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2232.codfw.wmnet [12:21:13] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2232.codfw.wmnet [12:21:18] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2234.codfw.wmnet [12:21:19] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2234.codfw.wmnet [12:21:26] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2235.codfw.wmnet [12:21:26] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2235.codfw.wmnet [12:21:35] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for db2235.codfw.wmnet [12:21:35] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2235.codfw.wmnet [12:23:02] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Create cookbook to add BGP peering for host by triggering Homer run on correct device - https://phabricator.wikimedia.org/T429488#12036813 (10ayounsi) On the scope creep list we could add a validator to prevent turning on the BGP flag if th... [12:24:36] (03PS1) 10Jforrester: ExecuteTestAndCacheJob: Don't explode when there are no connected Implementations/Tests [extensions/WikiLambda] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1304563 (https://phabricator.wikimedia.org/T429460) [12:25:24] (03CR) 10Jforrester: [C:03+1] Add /llms-rate-limits.txt and /llms-content-reuse.txt #2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304168 (https://phabricator.wikimedia.org/T429599) (owner: 10Effie Mouzeli) [12:25:28] (03CR) 10Jforrester: [C:03+1] Add /llms.txt where honest robots can read our API Policy #1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303454 (https://phabricator.wikimedia.org/T429599) (owner: 10Effie Mouzeli) [12:25:48] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] nodejs-22/nodejs-24: Bump versions to latest Node security releases [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1304533 (owner: 10Muehlenhoff) [12:27:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1304558 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [12:28:39] (03CR) 10Marostegui: [C:03+2] major-upgrade.py: Add !log dbmaint on the start [cookbooks] - 10https://gerrit.wikimedia.org/r/1303438 (owner: 10Marostegui) [12:32:25] (03PS13) 10Ayounsi: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 (https://phabricator.wikimedia.org/T415347) (owner: 10Jbond) [12:32:52] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1022.eqiad.wmnet [12:36:40] 10SRE-tools, 06Infrastructure-Foundations, 10Packaging: Upgrade prometheus-atlas-exporter - https://phabricator.wikimedia.org/T429672#12036853 (10ayounsi) [12:39:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1022.eqiad.wmnet [12:39:55] (03CR) 10Ayounsi: [C:03+1] homer: Don't explicitly set a file mode for the private homer checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304558 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [12:40:32] (03CR) 10Muehlenhoff: [C:03+2] homer: Don't explicitly set a file mode for the private homer checkout [puppet] - 10https://gerrit.wikimedia.org/r/1304558 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [12:40:47] (03PS1) 10DCausse: ttmserver-export: pass source language for translation batch IDs [extensions/Translate] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1304565 (https://phabricator.wikimedia.org/T429479) [12:41:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 22 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Translate] (wmf/1.47.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1304565 (https://phabricator.wikimedia.org/T429479) (owner: 10DCausse) [12:46:16] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs1001.eqiad.wmnet [12:51:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs1001.eqiad.wmnet [12:51:53] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs1002.eqiad.wmnet [12:55:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs1002.eqiad.wmnet [12:55:56] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs1003.eqiad.wmnet [12:57:59] (03PS1) 10Muehlenhoff: Allow cumin2003 in alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1304571 (https://phabricator.wikimedia.org/T427897) [12:58:52] (03CR) 10Elukey: [C:03+1] Allow cumin2003 in alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1304571 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [12:59:52] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Upgrade Cumin hosts to Trixie - https://phabricator.wikimedia.org/T427897#12036893 (10MoritzMuehlenhoff) >>! In T427897#12036771, @MoritzMuehlenhoff wrote: > https://gerrit.wikimedia.org/r/c/operations/puppet/+/1304532 fixed the permission issues we sa... [13:01:10] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs1003.eqiad.wmnet [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:32] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs2001.codfw.wmnet [13:02:58] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Traceback with IRC notifications on Trixie - https://phabricator.wikimedia.org/T429681 (10MoritzMuehlenhoff) 03NEW [13:08:09] (03CR) 10Muehlenhoff: [C:03+2] Allow cumin2003 in alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1304571 (https://phabricator.wikimedia.org/T427897) (owner: 10Muehlenhoff) [13:08:59] (03CR) 10Blake: main: Add a namespace for the mw-pretrain service. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304083 (https://phabricator.wikimedia.org/T427668) (owner: 10Blake) [13:09:08] (03PS2) 10Blake: main: Add a namespace for the mw-pretrain service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304083 (https://phabricator.wikimedia.org/T427668) [13:12:23] (03PS1) 10Kamila Součková: aux-k8s-services/*: Fix early inclusion of clusterinfo values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304572 (https://phabricator.wikimedia.org/T388390) [13:12:26] (03PS1) 10Kamila Součková: dse-k8s-services/*: Fix early inclusion of clusterinfo values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304573 (https://phabricator.wikimedia.org/T388390) [13:14:25] (03CR) 10Snwachukwu: "Actually that sounds better. Let me wait for Ben's response." [puppet] - 10https://gerrit.wikimedia.org/r/1303460 (https://phabricator.wikimedia.org/T425385) (owner: 10Snwachukwu) [13:15:50] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Fri 17 Jul 2026 12:40:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [13:19:31] (03CR) 10Kamila Součková: "NOT sorry 🚫🍺" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304572 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [13:19:55] 06SRE, 06Data-Persistence, 06DBA: Build wmfdb-admin for Trixie - https://phabricator.wikimedia.org/T427900#12036982 (10FCeratto-WMF) 05Open→03Resolved All done. [13:26:35] jmm@cumin2002 reimage (PID 291384) is awaiting input [13:27:28] (03CR) 10Clare Ming: Add phabricator api token for Test Kitchen (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [13:28:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS trixie [13:29:01] (03PS8) 10Clare Ming: Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) [13:31:46] (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 (owner: 10JHathaway) [13:41:10] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2009.codfw.wmnet with reason: host reimage [13:44:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2009.codfw.wmnet with reason: host reimage [13:46:34] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:47:40] (03PS2) 10Kamila Součková: dse-k8s-services/*: Fix early inclusion of clusterinfo values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304573 (https://phabricator.wikimedia.org/T388390) [13:52:36] (03CR) 10Santiago Faci: Add phabricator api token for Test Kitchen (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [13:54:12] (03CR) 10Santiago Faci: Add phabricator api token for Test Kitchen (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [13:55:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T429474#12037102 (10seanleong-WMDE) @Ladsgroup, yes, I can do that! Is it possible to do this after this ticket? I currently need to access the stats logs to check some script runs, b... [14:00:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2009.codfw.wmnet with OS trixie [14:04:55] (03CR) 10Santiago Faci: [C:03+2] Add phabricator api token for Test Kitchen (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [14:07:14] (03Merged) 10jenkins-bot: Add phabricator api token for Test Kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1303003 (https://phabricator.wikimedia.org/T428986) (owner: 10Clare Ming) [14:08:19] (03PS1) 10Kamila Součková: aux/zarcillo: don't hardcode helmBinary [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304588 (https://phabricator.wikimedia.org/T388390) [14:20:17] PROBLEM - Host dse-k8s-wdqs2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:07] RECOVERY - Host dse-k8s-wdqs2001 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [14:26:00] (03PS10) 10Federico Ceratto: sre.mysql: split pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/1295480 (https://phabricator.wikimedia.org/T422361) [14:26:45] 06SRE, 10SRE-swift-storage, 07Essential-Work: Migrate production swift clusters to trixie - https://phabricator.wikimedia.org/T429630#12037221 (10MatthewVernon) [14:26:52] (03PS3) 10Ladsgroup: admin: Add chudson to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1304201 (https://phabricator.wikimedia.org/T429353) [14:27:08] (03CR) 10Ladsgroup: [V:03+2 C:03+2] admin: Add chudson to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1304201 (https://phabricator.wikimedia.org/T429353) (owner: 10Ladsgroup) [14:27:38] (03CR) 10Federico Ceratto: "Rebased and ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1295480 (https://phabricator.wikimedia.org/T422361) (owner: 10Federico Ceratto) [14:29:39] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 1 for chudson - https://phabricator.wikimedia.org/T429353#12037227 (10Ladsgroup) 05In progress→03Resolved a:03Ladsgroup [14:29:48] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 1 for chudson - https://phabricator.wikimedia.org/T429353#12037229 (10Ladsgroup) [14:45:38] (03PS2) 10Elukey: docker_registry: remove support for the nginx blob cache [puppet] - 10https://gerrit.wikimedia.org/r/1304512 (https://phabricator.wikimedia.org/T427175) [14:45:38] (03PS1) 10Elukey: docker_registry: add migration block for ML images in nginx [puppet] - 10https://gerrit.wikimedia.org/r/1304596 (https://phabricator.wikimedia.org/T428022) [14:46:38] (03CR) 10Elukey: "Hey folks! The change needs to be tested $somewhere, maybe with Pontoon, but let me know if you see problems or if you have concerns with " [puppet] - 10https://gerrit.wikimedia.org/r/1304596 (https://phabricator.wikimedia.org/T428022) (owner: 10Elukey) [14:47:36] (03CR) 10Elukey: "I have to say that removing the cache blocks simplify a lot my task of migrating every docker image to S3, see https://gerrit.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/1304060 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [14:47:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:49:41] (03PS5) 10Daniel Kinzler: rest-gateway: emit 401 if rate limit is 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) [14:49:56] (03CR) 10Daniel Kinzler: rest-gateway: emit 401 if rate limit is 0 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1298031 (https://phabricator.wikimedia.org/T428184) (owner: 10Daniel Kinzler) [14:56:21] (03PS3) 10JHathaway: WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 [15:01:41] (03CR) 10CI reject: [V:04-1] WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 (owner: 10JHathaway) [15:03:23] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [15:04:23] 06SRE, 06Data-Platform-SRE: Redeploy cirrus-streaming-updater/producer and cirrus-streaming-updater/consumer to pick up current mirror - https://phabricator.wikimedia.org/T429671#12037375 (10dcausse) Should be done soon as part of T426839 when these images will be rebuilt and deployed using `docker-registry.wi... [15:04:23] (03PS4) 10JHathaway: WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 [15:04:36] 06SRE, 06Data-Platform-SRE: Redeploy cirrus-streaming-updater/producer and cirrus-streaming-updater/consumer to pick up current mirror - https://phabricator.wikimedia.org/T429671#12037379 (10dcausse) [15:05:55] (03CR) 10Federico Ceratto: "Other services in aux-k8s are not currently using this pattern according to" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1304588 (https://phabricator.wikimedia.org/T388390) (owner: 10Kamila Součková) [15:09:33] (03CR) 10CI reject: [V:04-1] WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 (owner: 10JHathaway) [15:11:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs2001.codfw.wmnet [15:11:48] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs2002.codfw.wmnet [15:12:46] (03CR) 10Btullis: "Yes, I mean this seems like a workable approach to me." [puppet] - 10https://gerrit.wikimedia.org/r/1303460 (https://phabricator.wikimedia.org/T425385) (owner: 10Snwachukwu) [15:17:04] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs2002.codfw.wmnet [15:17:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs2003.codfw.wmnet [15:18:07] (03PS5) 10JHathaway: WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 [15:22:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs2003.codfw.wmnet [15:22:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-wdqs2004.codfw.wmnet [15:23:18] (03CR) 10CI reject: [V:04-1] WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 (owner: 10JHathaway) [15:27:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-wdqs2004.codfw.wmnet [15:34:18] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs2001.codfw.wmnet [15:37:44] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1020.eqiad.wmnet [15:38:21] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.14 point update - https://phabricator.wikimedia.org/T426759#12037452 (10MoritzMuehlenhoff) [15:41:04] 06SRE: New Scroll Request for: Duck (Service) - https://phabricator.wikimedia.org/T426847#12037460 (10Aklapper) [15:44:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1020.eqiad.wmnet [15:44:16] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1021.eqiad.wmnet [15:45:44] !log btullis@puppetserver1001 conftool action : set/pooled=yes; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs2002.codfw.wmnet [15:46:13] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Traceback with IRC notifications on Trixie - https://phabricator.wikimedia.org/T429681#12037466 (10elukey) @MoritzMuehlenhoff the spicerack config mentions icinga.wikimedia.org:9200 as host:port combination and they don't seem reachable from cumin2003. [15:48:21] (03PS1) 10Elukey: profile::tcpircbot: allow cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1304610 (https://phabricator.wikimedia.org/T429681) [15:48:33] moritzm: --^ [15:52:50] ah, is there you another place besides modules/profile/manifests/tcpircbot.pp... [15:53:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1304610 (https://phabricator.wikimedia.org/T429681) (owner: 10Elukey) [15:54:39] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:55:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1021.eqiad.wmnet [15:55:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1022.eqiad.wmnet [16:01:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1022.eqiad.wmnet [16:01:49] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1023.eqiad.wmnet [16:04:21] (03PS6) 10JHathaway: WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 [16:08:37] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs2002.codfw.wmnet [16:08:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1023.eqiad.wmnet [16:09:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:13:17] (03CR) 10Elukey: [C:03+2] profile::tcpircbot: allow cumin2003 [puppet] - 10https://gerrit.wikimedia.org/r/1304610 (https://phabricator.wikimedia.org/T429681) (owner: 10Elukey) [16:14:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:17:26] (03PS1) 10JHathaway: load_ini_config: fix typing of config_file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304617 [16:19:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:58] (03PS1) 10JHathaway: notify_logger: fix tests [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304618 [16:21:36] (03PS1) 10JHathaway: durable: fix test when run in a tmux [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304619 [16:22:14] !log btullis@puppetserver1001 conftool action : set/pooled=no; selector: service=kubesvc,cluster=dse-k8s,dc=codfw,name=dse-k8s-wdqs2001.codfw.wmnet [16:22:16] (03CR) 10CI reject: [V:04-1] load_ini_config: fix typing of config_file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304617 (owner: 10JHathaway) [16:24:49] (03PS2) 10JHathaway: log: fix tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1304183 [16:24:50] (03CR) 10CI reject: [V:04-1] notify_logger: fix tests [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304618 (owner: 10JHathaway) [16:26:34] (03CR) 10CI reject: [V:04-1] durable: fix test when run in a tmux [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304619 (owner: 10JHathaway) [16:36:44] (03CR) 10JHathaway: sre.hosts.provision: introduce the wmfroot user (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1291994 (https://phabricator.wikimedia.org/T426180) (owner: 10Elukey) [16:37:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:43:23] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [16:48:38] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [17:59:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303006 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [18:00:42] (03Merged) 10jenkins-bot: Disable ShortUrl on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303006 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [18:01:25] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1303006|Disable ShortUrl on remaining wikis (T107188)]] [18:01:29] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [18:02:11] (03PS1) 10Sohom Datta: Add source tab to ukwikisource's Архів" (Archive) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304630 (https://phabricator.wikimedia.org/T53980) [18:03:36] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1303006|Disable ShortUrl on remaining wikis (T107188)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:12:48] 06SRE, 10dev-images, 06Infrastructure-Foundations, 06Release-Engineering-Team (Priority Backlog 📥): Rebuild dev-images using a base image without mirrors.wikimedia.org in the apt sources - https://phabricator.wikimedia.org/T423972#12037726 (10SomeRandomDeveloper) [18:57:05] (03CR) 10TK-999: "You might want to get the latest `main` to get my fixes: https://github.com/facebook/mcrouter/commit/7e66d4e8a8a47cdcde60bdc776ff65888cf6f" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1281942 (https://phabricator.wikimedia.org/T425255) (owner: 10Effie Mouzeli) [19:17:21] !log krinkle@deploy1003 krinkle: Continuing with deployment [19:21:39] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1303006|Disable ShortUrl on remaining wikis (T107188)]] (duration: 80m 14s) [19:21:44] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [19:22:50] (03Abandoned) 10Arlolra: Configure $wgTrackPreExpansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1303525 (https://phabricator.wikimedia.org/T353697) (owner: 10Arlolra) [19:33:53] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:39:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.48% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:44:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.45% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:44:39] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:44:53] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:49:53] RESOLVED: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:06:53] FIRING: SLOBudgetBurn: Standalone event system success rate is below 99.9% target - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [20:17:15] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [22:36:27] (03PS2) 10Sohom Datta: Add source tab to ukwikisource's "Архів" (Archive) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304630 (https://phabricator.wikimedia.org/T53980) [22:37:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304630 (https://phabricator.wikimedia.org/T53980) (owner: 10Sohom Datta) [22:56:17] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:57:17] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:42:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1304642 [23:42:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1304642 (owner: 10TrainBranchBot) [23:44:39] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:et-0/0/0 (Transport: Arelion (IC-398708) {#20260601}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:50:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1304642 (owner: 10TrainBranchBot)