[00:01:28] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon1003.eqiad.wmnet [00:02:46] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host kafkamon2003.codfw.wmnet [00:06:44] !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2003.codfw.wmnet [00:13:37] (03CR) 10Dzahn: [C:03+2] "works now. noop on both sides and puppet resources are managed on both sides. in the filesystem there are timers but no services on the so" [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [00:20:17] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1251205/8301/" [puppet] - 10https://gerrit.wikimedia.org/r/1251205 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:23:57] (03CR) 10Dzahn: "This will be the actual switch from old to new jenkins now." [puppet] - 10https://gerrit.wikimedia.org/r/1254308 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:25:29] (03CR) 10Dzahn: [V:03+1 C:03+2] jenkins: define contint1003 as the manager_host for the jenkins role [puppet] - 10https://gerrit.wikimedia.org/r/1254295 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:45:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [00:50:00] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:50:43] I have rescheduled T419312's deployment for 1:00 to 2:00 AM where I live (CDT in Texas), 7:00 to 8:00 AM in UTC [00:50:44] T419312: Addition of AbuseFilter blocking for the Portuguese Wikipedia - https://phabricator.wikimedia.org/T419312 [00:51:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:54:26] (03PS1) 10Dzahn: jenkins: allow rsyncing of data for migrating a jenkins server [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) [00:54:56] (03CR) 10CI reject: [V:04-1] jenkins: allow rsyncing of data for migrating a jenkins server [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [00:55:45] (03PS1) 10Dzahn: jenkins: remove httpd profile from role [puppet] - 10https://gerrit.wikimedia.org/r/1255139 [00:56:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:00:37] (03PS1) 10Dzahn: ci::jenkins: add firewall rule to allow legacy machines to new jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521) [01:01:16] (03CR) 10CI reject: [V:04-1] ci::jenkins: add firewall rule to allow legacy machines to new jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:05:00] (03PS1) 10Dzahn: ci::firewall: stop using IPs instead of host names [puppet] - 10https://gerrit.wikimedia.org/r/1255153 [01:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:18:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:44:25] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:50:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [01:54:36] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [02:08:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:00:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (2800:1e0:1025::10e) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Transit6&var-bgp_neighbor=EdgeUno - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:05:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:35:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:35:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:47:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:00] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:53:13] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [05:44:25] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T0600) [06:00:05] marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T0600). [07:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T0700). [07:00:05] codenamenoreste: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:07:42] I am available to test 1251200 and deploy today, if needed [07:14:38] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [07:14:56] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [07:15:08] Amir1 and urbanecm [07:16:13] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [07:16:34] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [07:17:11] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [07:17:35] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [07:21:44] (03PS1) 10AOkoth: miscweb: fix helmfile add wmf-navigator to releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405) [07:23:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:26:27] 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409#11726514 (10ayounsi) 05Open→03Resolved Indeed! The errors were happening with the same levels of traffic as we have now, so looks like i... [07:35:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:36:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet [07:36:22] (03PS4) 10Arnaudb: gerrit: adjust mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) [07:36:22] (03CR) 10Arnaudb: "The initial idea behind that change was to test our working theory on `MaxRequestWorkers`. I've updated the change to fit what's documente" [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [07:39:09] jmm@cumin2002 drain-node (PID 3980845) is awaiting input [07:47:26] (03PS1) 10AOkoth: miscweb: add wmf-navigator aux ingress record [dns] - 10https://gerrit.wikimedia.org/r/1255523 (https://phabricator.wikimedia.org/T414405) [07:48:29] well, no deployment today -_- [07:52:15] (03PS2) 10AOkoth: miscweb: add wmf-navigator aux ingress record [dns] - 10https://gerrit.wikimedia.org/r/1255523 (https://phabricator.wikimedia.org/T414405) [07:58:22] I'm still available [07:58:58] I can maybe backport after the train deployment to be done very soon [07:59:29] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11726540 (10Martyn.ranyard) As the EM of Annie's cross-functional team at WMDE, I approve this request. @katiamusiolekwmde has not yet got their phabricator... [08:00:05] andre and brennen: Your horoscope predicts another MediaWiki train - Utc-0+Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T0800). [08:00:10] o/ [08:00:20] (03PS1) 10Muehlenhoff: Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564 [08:00:52] (03CR) 10CI reject: [V:04-1] Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564 (owner: 10Muehlenhoff) [08:02:11] (03PS2) 10Muehlenhoff: Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564 (https://phabricator.wikimedia.org/T418993) [08:03:19] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255576 (https://phabricator.wikimedia.org/T413811) [08:03:24] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255576 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [08:04:33] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255576 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [08:05:39] jmm@cumin2002 drain-node (PID 3980845) is awaiting input [08:07:59] ml-etcd1002,dse-k8s-etcd1003 will go down for a Ganeti reboot [08:08:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1048.eqiad.wmnet [08:08:19] (03PS1) 10Slyngshede: C:external_clouds_vendors remove GeekyWorld [puppet] - 10https://gerrit.wikimedia.org/r/1255580 [08:09:50] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:09:50] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:43] (03CR) 10Elukey: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254867 (owner: 10Muehlenhoff) [08:10:58] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [08:10:58] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [08:11:50] (03PS1) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) [08:12:12] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.20 refs T413811 [08:12:16] T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811 [08:13:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1048.eqiad.wmnet [08:13:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1048.eqiad.wmnet [08:13:55] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11726559 (10Martyn.ranyard) @KFrancis could you organize the NDA signature for this request ? Thanks [08:14:33] !log installing imagemagick security updates on Bullseye [08:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:05] (03PS1) 10Muehlenhoff: Apply installserver role to install4004 [puppet] - 10https://gerrit.wikimedia.org/r/1255582 (https://phabricator.wikimedia.org/T418993) [08:19:24] (03PS1) 10Clément Goubert: rest-gateway: Add api.w.o device-analytics support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255590 (https://phabricator.wikimedia.org/T418147) [08:21:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [08:21:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [08:25:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply [08:25:49] (03CR) 10Ayounsi: [C:03+1] Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:26:30] (03CR) 10Muehlenhoff: [C:03+2] Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:26:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply [08:27:00] (03CR) 10Ayounsi: [C:03+1] Apply installserver role to install4004 [puppet] - 10https://gerrit.wikimedia.org/r/1255582 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:29:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh4003.wikimedia.org [08:29:58] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:31:43] !log installing python-apt security updates [08:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:42] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002" [08:34:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002" [08:34:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:34:49] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors [08:34:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors [08:35:08] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:37:08] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11726601 (10MPostoronca-WMF) [08:37:42] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11726603 (10MPostoronca-WMF) >>! In T420458#11723037, @ayounsi wrote: > @OKryva-WMF do you approve this request ? > @thcipriani do you approve this request ? > @MPostoronca-WMF could yo... [08:38:42] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh4003.wikimedia.org - jmm@cumin2002" [08:38:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh4003.wikimedia.org - jmm@cumin2002" [08:38:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:38:48] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors [08:38:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors [08:38:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh4003.wikimedia.org [08:39:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [08:40:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti4007.ulsfo.wmnet [08:41:34] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11726622 (10MatthewVernon) @hnowlan can I push this up your stack, please? Willy wants all procurement requests for next FY done by end of next week (i.e. 27 March). [08:42:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install4003.wikimedia.org to plain [08:42:45] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726624 (10ops-monitoring-bot) VM install4003.wikimedia.org switching disk type to plain [08:43:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install4003.wikimedia.org to plain [08:43:21] (03CR) 10Arnaudb: [C:03+1] miscweb: add wmf-navigator aux ingress record [dns] - 10https://gerrit.wikimedia.org/r/1255523 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [08:44:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of hcaptcha-proxy4002.wikimedia.org to plain [08:44:46] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726626 (10ops-monitoring-bot) VM hcaptcha-proxy4002.wikimedia.org switching disk type to plain [08:45:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of hcaptcha-proxy4002.wikimedia.org to plain [08:45:10] (03CR) 10Arnaudb: [C:03+1] miscweb: fix helmfile add wmf-navigator to releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [08:45:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of hcaptcha-proxy4001.wikimedia.org to plain [08:46:15] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726629 (10ops-monitoring-bot) VM hcaptcha-proxy4001.wikimedia.org switching disk type to plain [08:46:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of hcaptcha-proxy4001.wikimedia.org to plain [08:46:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh4002.wikimedia.org to plain [08:47:38] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:47:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:22] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726632 (10ops-monitoring-bot) VM doh4002.wikimedia.org switching disk type to plain [08:48:38] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:48:40] PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:48:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh4002.wikimedia.org to plain [08:49:25] FIRING: [10x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:49:35] (03PS3) 10Daniel Kinzler: rest-gateway: update readme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 [08:49:38] RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:50:28] PROBLEM - Bird Internet Routing Daemon on doh4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:52:28] RECOVERY - Bird Internet Routing Daemon on doh4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [08:54:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh4001.wikimedia.org to plain [08:54:25] FIRING: [16x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:54:40] FIRING: [16x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:56:00] (03PS1) 10Muehlenhoff: Remove ganeti4007 from classic Ganeti cluster in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255603 (https://phabricator.wikimedia.org/T418993) [08:56:28] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726637 (10ops-monitoring-bot) VM doh4001.wikimedia.org switching disk type to plain [08:56:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh4001.wikimedia.org to plain [08:58:00] (03CR) 10Ayounsi: [C:03+1] Remove ganeti4007 from classic Ganeti cluster in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255603 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [08:58:22] 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11726639 (10Sarmbruster) Just signed the NDA via docusign. [08:58:26] PROBLEM - Bird Internet Routing Daemon on doh4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:00:26] RECOVERY - Bird Internet Routing Daemon on doh4001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [09:01:50] !log remove ganeti4007 from classic Ganeti cluster in ulsfo T418993 [09:01:52] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:55] T418993: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993 [09:04:25] FIRING: [20x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:04:26] PROBLEM - ganeti-confd running on ganeti4007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:04:26] PROBLEM - ganeti-noded running on ganeti4007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:05:50] FIRING: ProbeDown: Service ganeti4007:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:54] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 222.32 ms [09:08:39] (03PS1) 10Effie Mouzeli: hieradata: migrate codfw memcached cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255610 (https://phabricator.wikimedia.org/T398611) [09:09:27] (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti4007 from classic Ganeti cluster in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255603 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [09:10:41] (03PS1) 10Brouberol: Revert^2 "kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255612 [09:11:59] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [09:13:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4007.ulsfo.wmnet with OS bookworm [09:13:29] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726657 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4007.ulsfo.wmnet with... [09:14:18] (03CR) 10Elukey: [C:03+1] Revert^2 "kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255612 (owner: 10Brouberol) [09:14:25] RESOLVED: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:15:47] (03CR) 10Brouberol: [C:03+2] Revert^2 "kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255612 (owner: 10Brouberol) [09:15:49] (03CR) 10Muehlenhoff: [C:03+1] "That is really great, many thanks for pushing this forward!" [puppet] - 10https://gerrit.wikimedia.org/r/1255610 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [09:19:17] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:19:40] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:21:20] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:21:43] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:24:02] !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-codfw [09:26:12] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [09:26:29] (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: add missing releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255620 (https://phabricator.wikimedia.org/T417407) [09:26:54] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [09:28:42] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726685 (10ayounsi) [09:29:03] !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-eqiad [09:31:43] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:33:15] (03CR) 10Jaime Nuche: "Make sense, thanks for the improvement @dzahn@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [09:33:29] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11726713 (10MoritzMuehlenhoff) [09:35:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4007.ulsfo.wmnet with reason: host reimage [09:35:37] !log installing libnginx-mod-http-lua security updates [09:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:41] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:39:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4007.ulsfo.wmnet with reason: host reimage [09:42:15] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@863e5c2] (releasing): T420477 [09:43:02] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@863e5c2] (releasing): T420477 (duration: 00m 59s) [09:43:24] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [09:45:34] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@863e5c2] (releasing): T420477 [09:46:41] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@863e5c2] (releasing): T420477 (duration: 01m 07s) [09:46:44] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:49:39] (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add missing releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255620 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:53:05] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet [09:53:13] (03PS3) 10Fabfur: haproxy: test haproxy32 on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) [09:53:24] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [09:53:44] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:56:20] (03CR) 10Gmodena: [C:03+1] wikidata-platform: wdqs-queryhammer helmfile deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [09:56:34] (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: add missing releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255620 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:57:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet [09:57:10] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467) (owner: 10Daniel Kinzler) [09:58:17] (03PS1) 10Muehlenhoff: Make ganeti4007 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255644 (https://phabricator.wikimedia.org/T418993) [09:58:28] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 17 hosts with reason: upgrade [09:58:37] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: migrate codfw memcached cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255610 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [09:59:22] (03Merged) 10jenkins-bot: rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467) (owner: 10Daniel Kinzler) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1000) [10:00:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4007.ulsfo.wmnet with OS bookworm [10:00:41] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4007.ulsfo.wmnet with OS b... [10:02:44] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:03:59] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:04:41] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:08:44] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:09:13] !log btullis@cumin1003 START - Cookbook sre.opensearch.roll-restart-reboot rolling reboot on A:datahubsearch [10:09:39] (03CR) 10Ayounsi: [C:03+1] Make ganeti4007 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255644 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:10:06] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11726795 (10jcrespo) [10:10:07] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420308#11726798 (10jcrespo) →14Duplicate dup:03T419970 [10:10:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2007.codfw.wmnet [10:10:51] (03CR) 10Muehlenhoff: [C:03+2] Make ganeti4007 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255644 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:13:30] !log fnegri@cumin1003 START - Cookbook sre.hosts.reboot-single for host clouddumps1001.wikimedia.org [10:14:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2007.codfw.wmnet [10:16:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2008.wikimedia.org [10:18:24] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:19:41] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:20:16] (03CR) 10Btullis: wikidata-platform: wdqs-queryhammer chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [10:20:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2008.wikimedia.org [10:21:46] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1001.wikimedia.org [10:21:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2006.codfw.wmnet [10:22:15] (03CR) 10Btullis: [C:03+2] wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [10:22:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [10:23:38] (03CR) 10AOkoth: [C:03+2] miscweb: fix helmfile add wmf-navigator to releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [10:23:55] (03Merged) 10jenkins-bot: wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [10:24:43] !log btullis@cumin1003 END (FAIL) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=99) rolling reboot on A:cephosd-eqiad [10:25:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling reboot on A:datahubsearch [10:25:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2006.codfw.wmnet [10:26:56] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:28:40] !log fnegri@cumin1003 START - Cookbook sre.hosts.reboot-single for host clouddumps1002.wikimedia.org [10:28:49] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:29:08] !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on P{cephosd100[4-5]*} and (A:cephosd-codfw or A:cephosd-eqiad) [10:30:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [10:31:47] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [10:32:26] !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [10:32:46] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet [10:33:41] !log aokoth@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [10:33:45] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:34:09] !log aokoth@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [10:36:14] (03PS1) 10Federico Ceratto: wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1255655 (https://phabricator.wikimedia.org/T416705) [10:36:20] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [10:36:44] !log created temporary categorylinks_icu72 tables -- T419980, T419049 [10:36:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet [10:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:49] T419980: ICU 72 upgrade: `categorylinks` table swap - https://phabricator.wikimedia.org/T419980 [10:36:50] T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049 [10:37:04] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1002.wikimedia.org [10:37:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet [10:37:11] 06SRE-OnFire, 10Cite, 10VisualEditor, 10WMDE-TechWish-Maintenance, and 3 others: Investigation: Write visual editor debug tool to produce Converter test cases - https://phabricator.wikimedia.org/T400311#11726876 (10WMDE-Fisch) @awight maybe we close this ticket and abandon leftover patches for now? 🤔 [10:39:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet [10:40:45] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:41:29] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet [10:42:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4007.ulsfo.wmnet to cluster ulsfo02 and group 01 [10:43:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4007.ulsfo.wmnet to cluster ulsfo02 and group 01 [10:43:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh4003.wikimedia.org [10:43:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:44:24] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726885 (10MoritzMuehlenhoff) [10:45:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [10:45:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet [10:46:02] (03PS1) 10Brouberol: kafka-main-codfw: disable mirroring to kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407) [10:46:04] (03PS1) 10Brouberol: kafka-main-eqiad: disable mirroring to kafka-main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407) [10:46:07] (03PS1) 10Brouberol: kafka-jumbo-eqiad: disable mirroring from kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255658 (https://phabricator.wikimedia.org/T417407) [10:47:54] (03PS2) 10Brouberol: kafka-main-eqiad: disable mirroring to kafka-main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407) [10:47:54] (03PS2) 10Brouberol: kafka-main-codfw: disable mirroring to kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407) [10:47:54] (03PS2) 10Brouberol: kafka-jumbo-eqiad: disable mirroring from kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255658 (https://phabricator.wikimedia.org/T417407) [10:48:00] (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqiad-to-main-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255659 (https://phabricator.wikimedia.org/T417407) [10:48:03] (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) [10:48:05] (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) [10:48:10] (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407) [10:48:45] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:48:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [10:49:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2004.codfw.wmnet [10:50:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002" [10:50:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002" [10:50:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:50:14] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors [10:50:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors [10:50:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:51:08] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-codfw [10:53:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2004.codfw.wmnet [10:54:29] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh4003.wikimedia.org - jmm@cumin2002" [10:54:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh4003.wikimedia.org - jmm@cumin2002" [10:54:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:54:38] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors [10:54:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors [10:54:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh4003.wikimedia.org [10:55:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on P{cephosd100[4-5]*} and (A:cephosd-codfw or A:cephosd-eqiad) [10:55:45] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:58:21] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11726917 (10fgiunchedi) FWIW I found some prior art / ideas here {T367592} [10:59:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet [11:03:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet [11:05:07] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw [11:07:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [11:08:41] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:51] (03CR) 10Ladsgroup: "You can use https://switchmaster.toolforge.org/dc-switch to create this and it's much safer since it's automatic." [dns] - 10https://gerrit.wikimedia.org/r/1255655 (https://phabricator.wikimedia.org/T416705) (owner: 10Federico Ceratto) [11:10:28] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11726958 (10jijiki) [11:11:05] (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add main-eqiad-to-main-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255659 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:11:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [11:11:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [11:12:12] (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:12:36] (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:13:28] (03CR) 10Elukey: [C:04-1] "Precautionary -1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:13:41] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:14:14] (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:15:01] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:17:36] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255668 (https://phabricator.wikimedia.org/T420448) [11:18:01] !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-codfw [11:18:09] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-codfw [11:19:28] (03PS1) 10Gerrit maintenance bot: wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) [11:20:48] PROBLEM - BFD status on lsw1-a7-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:21:55] (03CR) 10Muehlenhoff: "(We'll keep that up when Jelto is back)" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [11:25:01] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:25:53] 06SRE, 10SRE-Access-Requests: Requesting access to +2 on operations/deployment-charts for trueg and lerickson - https://phabricator.wikimedia.org/T420568 (10trueg) 03NEW [11:26:29] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage [11:26:48] (03CR) 10TChin: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255668 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton) [11:27:30] (03PS2) 10Ladsgroup: wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot) [11:28:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:28:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:28:48] RECOVERY - BFD status on lsw1-a7-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:30:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage [11:32:25] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [11:33:07] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255668 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton) [11:34:37] 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11727019 (10Nikerabbit) 05In progress→03Resolved [11:35:14] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255668 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton) [11:35:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:48] PROBLEM - BFD status on lsw1-c2-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:40:13] (03PS4) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) [11:40:35] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:40:44] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:41:26] (03CR) 10JMeybohm: [C:03+1] cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [11:41:48] (03CR) 10JMeybohm: [C:03+1] charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [11:42:48] RECOVERY - BFD status on lsw1-c2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:43:54] 06SRE, 06Infrastructure-Foundations: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572 (10MoritzMuehlenhoff) 03NEW [11:44:35] 06SRE, 06Infrastructure-Foundations: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572#11727103 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:44:48] (03PS1) 10Michael Große: createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) [11:45:02] (03PS1) 10Michael Große: CreateAccount: Add class to aide in instrumentation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255686 [11:45:08] (03CR) 10CI reject: [V:04-1] createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [11:46:53] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:46:58] (03PS5) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) [11:47:10] (03PS2) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) [11:47:10] (03PS2) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) [11:47:10] (03PS2) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407) [11:47:20] (03CR) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:47:27] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [11:47:45] (03CR) 10CI reject: [V:04-1] mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [11:48:13] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:48:15] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:48:17] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255658 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [11:48:18] (03CR) 10Trueg: [C:03+2] wikidata-platform: wdqs-queryhammer helmfile deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [11:48:28] (03PS6) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) [11:48:32] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [11:49:57] btullis@cumin1003 reimage (PID 342152) is awaiting input [11:50:08] (03Merged) 10jenkins-bot: wikidata-platform: wdqs-queryhammer helmfile deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [11:51:48] PROBLEM - BFD status on lsw1-d2-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:51:48] (03PS1) 10Dreamy Jazz: mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) [11:53:06] jouncebot: next [11:53:06] In 0 hour(s) and 6 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1200) [11:53:37] !log upgrade rpki2003 to Routinator 0.15.1 T420572 [11:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:42] T420572: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572 [11:54:56] (03PS2) 10Dreamy Jazz: mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) [11:55:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [11:55:24] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) (owner: 10Dreamy Jazz) [11:55:52] (03PS1) 10JMeybohm: wikikube: Add wikikube-worker[1335-1349].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1255689 (https://phabricator.wikimedia.org/T418259) [11:57:12] (03CR) 10JMeybohm: [C:03+2] Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [11:57:48] RECOVERY - BFD status on lsw1-d2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:57:55] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-codfw [11:58:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd-codfw [11:59:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:59:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [12:00:01] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1200) [12:00:20] (03CR) 10Michael Große: "recheck" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [12:00:44] (03PS1) 10Muehlenhoff: versitygw: Don't set file ownership for root:root [puppet] - 10https://gerrit.wikimedia.org/r/1255690 [12:00:47] (03CR) 10Clément Goubert: [C:03+1] wikikube: Add wikikube-worker[1335-1349].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1255689 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [12:03:40] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 2 others: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11727165 (10MoritzMuehlenhoff) >>! In T410028#11714176, @jcrespo wrote: > For this, the... [12:03:41] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:04:25] (03PS1) 10Btullis: Put dse-k8s-worker101[6-7] back into service [puppet] - 10https://gerrit.wikimedia.org/r/1255692 (https://phabricator.wikimedia.org/T414787) [12:04:57] (03Merged) 10jenkins-bot: Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:07:25] (03PS1) 10Dreamy Jazz: mw::maintenance: Run purgeRecentChanges.php on wikis without CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062) [12:07:43] (03CR) 10Jcrespo: [C:03+2] versitygw: Don't set file ownership for root:root [puppet] - 10https://gerrit.wikimedia.org/r/1255690 (owner: 10Muehlenhoff) [12:08:19] (03PS2) 10Dreamy Jazz: mw::maintenance: Run purgeRecentChanges.php on wikis without CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062) [12:09:24] (03CR) 10Btullis: [C:03+2] Put dse-k8s-worker101[6-7] back into service [puppet] - 10https://gerrit.wikimedia.org/r/1255692 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [12:10:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [12:10:06] !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:reassignMentees --wiki=enwiki --mentor=Bilorv --performer=Bilorv --as-job # T418194 [12:10:10] T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194 [12:10:30] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [12:10:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255686 (owner: 10Michael Große) [12:11:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [12:12:02] (03PS1) 10Muehlenhoff: Switch backup1015 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255697 (https://phabricator.wikimedia.org/T410028) [12:15:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [12:17:20] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: Add api.w.o device-analytics support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255590 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert) [12:21:16] (03PS1) 10Jcrespo: mediabackup: Switch backup new media storage hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255704 (https://phabricator.wikimedia.org/T410028) [12:22:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1255704 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:22:15] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:22:55] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:23:11] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:23:18] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255708 [12:24:19] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:25:02] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:25:40] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:27:46] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:27:55] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [12:28:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:28:22] (03CR) 10Jcrespo: [C:03+2] mediabackup: Switch backup new media storage hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255704 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:29:34] (03CR) 10JMeybohm: [C:03+2] Fix PodSecurityPolicy related comments [puppet] - 10https://gerrit.wikimedia.org/r/1250524 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [12:29:38] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:29:41] (03PS2) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) [12:31:05] (03PS7) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) [12:31:13] (03CR) 10Kamila Součková: [C:03+2] shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [12:31:14] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:31:54] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:33:10] jynus: good to merge "mediabackup: Switch backup new media storage hosts to nftables" ? [12:33:22] yes [12:33:29] sorry, too many things ongoing [12:33:34] np, done [12:33:41] I was about to [12:33:52] (03Merged) 10jenkins-bot: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [12:34:10] (03Abandoned) 10Muehlenhoff: Switch backup1015 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255697 (https://phabricator.wikimedia.org/T410028) (owner: 10Muehlenhoff) [12:34:30] (03CR) 10Jcrespo: "Thank you so much Moritz for your help!" [puppet] - 10https://gerrit.wikimedia.org/r/1255697 (https://phabricator.wikimedia.org/T410028) (owner: 10Muehlenhoff) [12:36:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [12:37:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh4003.wikimedia.org [12:37:21] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [12:37:22] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:37:51] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [12:38:31] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [12:39:12] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [12:39:54] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [12:40:28] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [12:40:46] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [12:40:59] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [12:41:12] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [12:41:13] (03PS13) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [12:41:23] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm7001.magru.wmnet [12:41:39] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [12:41:56] (03PS14) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [12:42:15] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:42:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727260 (10brouberol) {F73147012} We can see that sockets are no longer leaking after the NIC replace... [12:42:30] (03PS1) 10Muehlenhoff: Remove testvm7001 [puppet] - 10https://gerrit.wikimedia.org/r/1255718 (https://phabricator.wikimedia.org/T396864) [12:42:39] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727262 (10brouberol) 05Open→03Resolved a:03brouberol [12:42:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [12:43:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727266 (10brouberol) a:05brouberol→03BTullis [12:43:28] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [12:43:40] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002" [12:43:58] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [12:44:05] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [12:44:18] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [12:44:27] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [12:44:41] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [12:44:49] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [12:45:03] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [12:45:11] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:45:26] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:45:43] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:46:04] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:46:17] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:46:25] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-codfw [12:46:45] jmm@cumin2002 makevm (PID 4054560) is awaiting input [12:46:57] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:47:13] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:47:34] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1016.eqiad.wmnet [12:47:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:09] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [12:48:28] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [12:49:41] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [12:50:04] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [12:50:10] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [12:50:24] (03CR) 10Jcrespo: [C:03+2] mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:50:32] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [12:50:47] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [12:51:17] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [12:51:42] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [12:52:00] jmm@cumin2002 decommission (PID 4055145) is awaiting input [12:52:39] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [12:52:54] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [12:53:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1016.eqiad.wmnet [12:53:58] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1017.eqiad.wmnet [12:54:40] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [12:57:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002" [12:57:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:57:53] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors [12:57:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors [12:58:32] (03PS1) 10Cathal Mooney: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) [12:59:22] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:59:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:59:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:59:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm7001.magru.wmnet [12:59:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1017.eqiad.wmnet [13:00:00] (03CR) 10CI reject: [V:04-1] Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [13:00:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1300). [13:00:05] MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] (03PS1) 10Jcrespo: mediabackup: Skip references to Debian package versitygw until available [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028) [13:00:26] Hey 👋 [13:00:41] MichaelG_WMF: i can deploy today [13:00:46] (03PS2) 10Jcrespo: mediabackup: Skip references to Debian package versitygw until available [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028) [13:00:50] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:00:57] (03CR) 10Tchanders: [C:03+1] mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) (owner: 10Dreamy Jazz) [13:00:58] Thanks urbanecm :) [13:01:05] (03CR) 10STran: [C:03+1] mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) (owner: 10Dreamy Jazz) [13:01:17] MichaelG_WMF: any objections if i deploy both patches at the same time? [13:01:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:01:36] nope, that makes sense [13:01:50] (03CR) 10Urbanecm: [C:03+2] CreateAccount: Add class to aide in instrumentation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255686 (owner: 10Michael Große) [13:01:51] (03CR) 10Urbanecm: [C:03+2] createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:01:52] !log installing rsync security updates [13:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:12] (03Merged) 10jenkins-bot: CreateAccount: Add class to aide in instrumentation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255686 (owner: 10Michael Große) [13:04:13] (03PS2) 10Cathal Mooney: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) [13:04:13] (03CR) 10CI reject: [V:04-1] createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:04:16] MichaelG_WMF: CI dislikes the WikimediaEvents patch :/ [13:04:39] 14:03:32 stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/vendor/': GnuTLS recv error (-54): Error in the pull function.' [13:04:43] seems unrelated... [13:04:45] (03CR) 10Jcrespo: [C:03+2] mediabackup: Skip references to Debian package versitygw until available [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:04:51] (03CR) 10Urbanecm: [C:03+2] "..." [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:05:02] (03PS3) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148) [13:05:02] (03PS1) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) [13:05:04] yeah, that would be strange. it was fine in test just moments ago [13:05:36] rerunning [13:05:36] (03PS3) 10Cathal Mooney: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) [13:07:14] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 17 hosts with reason: upgrade [13:07:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:08:07] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T420578 (10CorinnaHillebrand_WMDE) 03NEW [13:08:25] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254867 (owner: 10Muehlenhoff) [13:09:13] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh4003.wikimedia.org - jmm@cumin2002" [13:09:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh4003.wikimedia.org - jmm@cumin2002" [13:09:22] (03Merged) 10jenkins-bot: createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große) [13:09:37] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T420578#11727365 (10CorinnaHillebrand_WMDE) @Hany.elmokadem as soon as you're back, could you give me your approval as my manager here? [13:09:50] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1255686|CreateAccount: Add class to aide in instrumentation]], [[gerrit:1255685|createAccount: Log exposure and CTRs for account creation experiment (T419916)]] [13:09:53] T419916: [V1 experiment release] Redesign mobile web account creation form following Codex guidelines - https://phabricator.wikimedia.org/T419916 [13:12:20] jmm@cumin2002 makevm (PID 4054560) is awaiting input [13:12:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh4003.wikimedia.org with OS bookworm [13:13:24] (03CR) 10Muehlenhoff: [C:03+2] Remove testvm7001 [puppet] - 10https://gerrit.wikimedia.org/r/1255718 (https://phabricator.wikimedia.org/T396864) (owner: 10Muehlenhoff) [13:13:41] !log urbanecm@deploy2002 migr, urbanecm: Backport for [[gerrit:1255686|CreateAccount: Add class to aide in instrumentation]], [[gerrit:1255685|createAccount: Log exposure and CTRs for account creation experiment (T419916)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:14:02] * MichaelG_WMF is looking [13:14:20] ty [13:15:22] urbanecm: looks good 👍 [13:15:40] !log urbanecm@deploy2002 migr, urbanecm: Continuing with sync [13:15:44] proceeding [13:22:14] !log upgrade rpki1001 to Routinator 0.15.1 T420572 [13:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:19] T420572: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572 [13:22:44] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth: update check for authdns_update_run [puppet] - 10https://gerrit.wikimedia.org/r/1255038 (owner: 10Ssingh) [13:22:48] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255686|CreateAccount: Add class to aide in instrumentation]], [[gerrit:1255685|createAccount: Log exposure and CTRs for account creation experiment (T419916)]] (duration: 12m 58s) [13:22:52] T419916: [V1 experiment release] Redesign mobile web account creation form following Codex guidelines - https://phabricator.wikimedia.org/T419916 [13:23:13] MichaelG_WMF: done [13:23:34] urbanecm: Thank you! [13:28:41] FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:29:30] (03PS3) 10Ssingh: Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [13:31:40] If there's room in the window, I've got a backport to do [13:31:44] Just waiting on CI [13:32:32] phuedx: go ahead [13:32:49] Ta [13:33:28] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh4003.wikimedia.org with reason: host reimage [13:37:27] (03CR) 10Eevans: [C:03+2] cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:37:29] (03PS4) 10Jforrester: Expose new wikifunctions.v0 REST API module on Wikifunctions.org only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) [13:38:04] phuedx: When you're done, please shout. [13:38:33] James_F: Go for it. I'm fighting with CI at this point [13:38:37] Ack. [13:38:41] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:38:46] (03Abandoned) 10Jcrespo: Revert^4 "garage: Add a first role and profile" [puppet] - 10https://gerrit.wikimedia.org/r/1212080 (owner: 10Jcrespo) [13:38:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh4003.wikimedia.org with reason: host reimage [13:39:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) (owner: 10Jforrester) [13:39:19] (03Merged) 10jenkins-bot: cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:39:27] (03PS2) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) [13:39:55] (03PS1) 10Daniel Kinzler: api-gateway: add Lua hooks mechanism for rest_gateway_routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 [13:39:55] (03CR) 10Eevans: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:40:06] (03CR) 10CI reject: [V:04-1] charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [13:40:20] (03Merged) 10jenkins-bot: Expose new wikifunctions.v0 REST API module on Wikifunctions.org only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) (owner: 10Jforrester) [13:40:40] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1250107|Expose new wikifunctions.v0 REST API module on Wikifunctions.org only (T419053)]] [13:40:45] T419053: Add REST module for Wikifunctions - https://phabricator.wikimedia.org/T419053 [13:40:51] (03PS1) 10Jcrespo: mediabackup: Switch backup media worker hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255732 (https://phabricator.wikimedia.org/T410028) [13:40:55] (03PS6) 10Eevans: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) [13:40:55] (03PS7) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [13:41:29] (03PS2) 10Jcrespo: mediabackup: Switch backup media worker hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255732 (https://phabricator.wikimedia.org/T410028) [13:41:32] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255732 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:42:32] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1250107|Expose new wikifunctions.v0 REST API module on Wikifunctions.org only (T419053)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:42:51] !log jforrester@deploy2002 jforrester: Continuing with sync [13:43:18] 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11727544 (10CorinnaHillebrand_WMDE) [13:46:36] phuedx: Over to you; good luck with CI wrestling. [13:46:43] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250107|Expose new wikifunctions.v0 REST API module on Wikifunctions.org only (T419053)]] (duration: 06m 03s) [13:46:47] T419053: Add REST module for Wikifunctions - https://phabricator.wikimedia.org/T419053 [13:46:49] Many thanks <3 [13:47:13] (03CR) 10Jcrespo: [C:03+2] mediabackup: Switch backup media worker hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255732 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:47:46] (03CR) 10Fabfur: [C:03+2] haproxy: test haproxy32 on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:48:21] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:48:58] (03PS4) 10Cathal Mooney: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) [13:49:22] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:49:22] (03PS1) 10Jforrester: Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 [13:49:22] (03CR) 10Jforrester: "Hey @aschulz@wikimedia.org, I used this nicer style for wiki-specific config for the Wikifunctions API config and it works well. I've made" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester) [13:49:27] (03PS2) 10Jforrester: Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 [13:50:32] (03PS1) 10Jforrester: Move GrowthExperiments REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250114 [13:50:55] 06SRE, 06Infrastructure-Foundations: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572#11727581 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All RPKI host are upgraded and Cathal confirmed it's all working fine [13:52:39] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:52:42] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:52:56] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:52:58] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:53:08] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:53:12] (03PS1) 10Fabfur: haproxy: fix lua lib version with haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1255735 (https://phabricator.wikimedia.org/T419825) [13:54:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh4003.wikimedia.org with OS bookworm [13:54:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh4003.wikimedia.org [13:55:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255735 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:58:11] (03CR) 10Ssingh: [C:03+1] haproxy: fix lua lib version with haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1255735 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:58:30] (03CR) 10Fabfur: [C:03+2] haproxy: fix lua lib version with haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1255735 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [13:58:33] jouncebot next [13:58:33] In 0 hour(s) and 31 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1430) [13:59:50] (03PS1) 10Kosta Harlan: hcaptcha: Use the global edit key for MobileFrontend edits if present [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) [14:00:26] (03CR) 10Ayounsi: [C:03+1] Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:01:43] (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:02:33] (03CR) 10Elukey: [C:03+1] kafka-main-eqiad: disable mirroring to kafka-main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:02:49] (03CR) 10Elukey: [C:03+1] kafka-main-codfw: disable mirroring to kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:02:59] (03CR) 10Elukey: [C:03+1] kafka-jumbo-eqiad: disable mirroring from kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255658 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:03:12] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:04:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh4004.wikimedia.org [14:04:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:05:45] 06SRE, 06Traffic: Anycast ns[01].wikimedia.org for IPv4 - https://phabricator.wikimedia.org/T366193#11727707 (10cmooney) >>! In T366193#11713908, @ssingh wrote: >>> I think we should clean up stuff in the interim though since it will be a while before we can get our hands on the /24. I will need your help with... [14:06:33] (03PS1) 10Muehlenhoff: Make doh4003/doh4004 new wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255738 (https://phabricator.wikimedia.org/T418993) [14:09:31] FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:04] ^Yeah FWIW gerrit seems to be struggling [14:11:26] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4004.wikimedia.org - jmm@cumin2002" [14:12:28] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [14:13:14] Daimona: do you have issues with ip4? the issue seem to only be with ipv6 [14:13:40] It seems to be working normally now [14:13:48] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:13:54] I had like a massive slowdown for ~5 minutes, haven't checked anything tho [14:14:02] ack [14:14:30] jmm@cumin2002 makevm (PID 4074555) is awaiting input [14:14:31] RESOLVED: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4004.wikimedia.org - jmm@cumin2002" [14:17:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:17:26] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4004.wikimedia.org on all recursors [14:17:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4004.wikimedia.org on all recursors [14:17:51] (03PS1) 10BBlack: Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) [14:17:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh4004.wikimedia.org - jmm@cumin2002" [14:18:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh4004.wikimedia.org - jmm@cumin2002" [14:18:07] PROBLEM - Ensure traffic_server is running for instance backend on cp4043 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:18:16] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [14:18:24] (03PS2) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqiad-to-main-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255659 (https://phabricator.wikimedia.org/T417407) [14:18:24] (03PS3) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) [14:18:25] (03PS3) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) [14:18:25] (03PS3) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407) [14:19:07] RECOVERY - Ensure traffic_server is running for instance backend on cp4043 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:19:25] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [14:20:10] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [14:20:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh4004.wikimedia.org with OS bookworm [14:21:29] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:37] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [14:22:02] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-conf1004.eqiad.wmnet [14:22:49] (03PS1) 10Fabfur: profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) [14:23:11] (03CR) 10Ayounsi: [C:03+2] Anycast: prepend once more when peering with the core routers [homer/public] - 10https://gerrit.wikimedia.org/r/1254185 (https://phabricator.wikimedia.org/T420342) (owner: 10Ayounsi) [14:24:57] (03PS4) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) [14:24:57] (03PS4) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407) [14:25:17] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=dse-k8s-worker1012.eqiad.wmnet|dse-k8s-worker1015.eqiad.wmnet|dse-k8s-worker1016.eqiad.wmnet|dse-k8s-worker1017.eqiad.wmnet [14:26:10] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [14:27:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1004.eqiad.wmnet [14:29:17] (03PS8) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) [14:29:29] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [14:29:57] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-conf1005.eqiad.wmnet [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1430) [14:30:44] 06SRE-OnFire, 10Cite, 10VisualEditor, 10WMDE-TechWish-Maintenance, and 3 others: Investigation: Write visual editor debug tool to produce Converter test cases - https://phabricator.wikimedia.org/T400311#11727887 (10awight) 05Open→03Resolved a:03awight Great! We have some more fine-tuning to make... [14:31:43] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.20 ms [14:31:46] (03CR) 10Cathal Mooney: [C:03+2] Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:31:50] (03PS1) 10Phuedx: Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a [extensions/TestKitchen] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255747 [14:31:55] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:31:55] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:32:13] why is that :P [14:32:23] !log bking@cumin2002 conftool action : set/pooled=yes:weight=1; selector: name=dse-k8s-worker1010.eqiad.wmnet|dse-k8s-worker1011.eqiad.wmnet|dse-k8s-worker1012.eqiad.wmnet|dse-k8s-worker1013.eqiad.wmnet|dse-k8s-worker1015.eqiad.wmnet|dse-k8s-worker1016.eqiad.wmnet|dse-k8s-worker1017.eqiad.wmnet|dse-k8s-worker1018.eqiad.wmnet|dse-k8s-worker1019.eqiad.wmnet [14:33:05] (03Merged) 10jenkins-bot: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:34:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) (owner: 10Kosta Harlan) [14:35:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1005.eqiad.wmnet [14:38:07] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:36] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:40:00] (03PS1) 10Cathal Mooney: Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) [14:40:30] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:41:10] (03CR) 10CI reject: [V:04-1] Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:41:53] (03PS2) 10BBlack: Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) [14:42:35] jouncebot: nowandnext [14:42:36] For the next 0 hour(s) and 17 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1430) [14:42:36] In 0 hour(s) and 17 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1500) [14:43:10] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh4004.wikimedia.org with reason: host reimage [14:43:15] (03CR) 10BCornwall: [C:03+1] Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack) [14:44:27] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11727944 (10OKryva-WMF) Approved. [14:46:06] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-conf1006.eqiad.wmnet [14:46:50] (03CR) 10BCornwall: [V:03+1 C:03+1] "`0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack) [14:48:21] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 265.74 ms [14:49:01] (03CR) 10BCornwall: "+1 in the sense that the code seems sound - no idea about the hostname accuracy." [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot) [14:49:10] (03CR) 10BCornwall: [C:03+1] wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot) [14:49:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh4004.wikimedia.org with reason: host reimage [14:51:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1006.eqiad.wmnet [14:51:37] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1001.eqiad.wmnet [14:52:15] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet [14:54:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host matomo1003.eqiad.wmnet [14:55:11] (03PS4) 10Fabfur: profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) [14:55:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1001.eqiad.wmnet [14:55:51] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1002.eqiad.wmnet [14:56:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet [14:57:53] (03PS1) 10Brouberol: kafka-mirrormaker: enable JMX metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255755 (https://phabricator.wikimedia.org/T417407) [14:58:35] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1003.eqiad.wmnet [14:59:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1002.eqiad.wmnet [15:00:01] (03CR) 10Phuedx: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack) [15:00:04] andre and brennen: #bothumor I � Unicode. All rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1500). [15:01:15] (03CR) 10BCornwall: "Could this all be condensed into something like `unless debian::codename::eq('trixie') and $haproxy_version == 'haproxy30'`?" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:01:32] (03CR) 10BCornwall: "Marking unresolved" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:01:55] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1003.eqiad.wmnet [15:02:13] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11728021 (10RobH) Please note the ticket was opened but their portal doesn't seem to email myself, Arzhel, or Cathal even though I listed all three of us on the tic... [15:03:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:05:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1003.eqiad.wmnet [15:06:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh4004.wikimedia.org with OS bookworm [15:06:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh4004.wikimedia.org [15:06:53] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5025.eqsin.wmnet with OS trixie [15:07:04] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS trixie [15:07:30] (03CR) 10Phuedx: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack) [15:08:00] I'm looking to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TestKitchen/+/1255747. It should fix the large volume of validation errors on the mediawiki.api-request event stream [15:08:11] (03CR) 10Elukey: [C:03+1] kafka-mirrormaker: enable JMX metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255755 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:08:41] andre, brennen: Would an out of band deployment disrupt you? [15:08:44] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11728044 (10RobH) Summary: * EdgeUno says they see no errors only our flap * Arzhel replied back stating that we are still seeing errors, stressed that we've alread... [15:08:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:09:02] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:09:06] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:09:07] phuedx: train has reached its final destination and things are calm [15:09:30] in general, see https://versions.toolforge.org/ for a quick versions check :) [15:09:42] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:10:17] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:10:36] andre: Thanks. OK. Starting [15:10:56] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:10:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255747 (owner: 10Phuedx) [15:11:01] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [15:12:30] (03CR) 10Dzahn: "does this apply to a service on the aux cluster though?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [15:12:35] (03Merged) 10jenkins-bot: Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a [extensions/TestKitchen] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255747 (owner: 10Phuedx) [15:12:57] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1255747|Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a]] [15:13:45] RESOLVED: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:14:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4003.wikimedia.org [15:14:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:14:46] (03PS5) 10Fabfur: profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) [15:14:48] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1255747|Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:14:55] (03CR) 10Fabfur: "Probably yes, I was thinking about supporting also future versions but let's start easy and do this way instead!" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:15:18] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:15:22] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:15:42] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:15:46] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:16:42] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:16:46] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:17:02] (03PS2) 10Cathal Mooney: Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) [15:17:37] (03PS3) 10Cathal Mooney: Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) [15:17:39] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:17:43] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:18:40] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:18:40] Quick check on enwiki main page looks good and the logs look clean (no warnings or errors in Logstash) [15:18:44] !log jayme@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:18:49] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:18:58] !log phuedx@deploy2002 phuedx: Continuing with sync [15:19:38] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:19:52] (03CR) 10BCornwall: [C:03+1] profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:19:53] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:57] jmm@cumin2002 makevm (PID 4091383) is awaiting input [15:21:44] !log jayme@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:21:48] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:21:49] (03CR) 10Ssingh: [C:03+1] Make doh4003/doh4004 new wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255738 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [15:21:57] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:22:00] !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [15:22:17] !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:22:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4003.wikimedia.org - jmm@cumin2002" [15:22:52] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255747|Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a]] (duration: 09m 55s) [15:24:55] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 222.72 ms [15:25:10] (03CR) 10Jcrespo: [C:03+2] mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [15:25:24] Monitoring logs [15:25:49] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [15:25:54] jmm@cumin2002 makevm (PID 4091383) is awaiting input [15:26:17] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet [15:28:56] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief1002.eqiad.wmnet [15:29:28] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet [15:30:01] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:30:07] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet [15:31:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4003.wikimedia.org - jmm@cumin2002" [15:31:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:31:06] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy4003.wikimedia.org on all recursors [15:31:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy4003.wikimedia.org on all recursors [15:31:41] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4003.wikimedia.org - jmm@cumin2002" [15:31:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4003.wikimedia.org - jmm@cumin2002" [15:32:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4003.wikimedia.org with OS bookworm [15:32:41] (03CR) 10Ssingh: "(Still on the list to review, not forgotten about this.)" [puppet] - 10https://gerrit.wikimedia.org/r/1250626 (owner: 10Majavah) [15:32:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1002.eqiad.wmnet [15:32:54] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief2002.codfw.wmnet [15:33:06] (03CR) 10BBlack: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack) [15:33:43] PROBLEM - MariaDB Replica IO: s3 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:33:43] PROBLEM - MariaDB Replica SQL: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:34:29] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5026.eqsin.wmnet with OS trixie [15:34:39] PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 550.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:34:46] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5025.eqsin.wmnet with OS trixie [15:34:52] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS trixie [15:35:06] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5025.eqsin.wmnet with OS trixie [15:35:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:43] (03PS4) 10Ssingh: Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [15:35:57] Logs look clean and the validation errors have disappeared 👍 [15:36:16] (03CR) 10CI reject: [V:04-1] Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [15:36:44] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2002.codfw.wmnet [15:36:55] (03PS5) 10Ssingh: Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [15:40:04] (03PS3) 10Dzahn: jenkins: allow rsyncing of data for migrating a jenkins server [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) [15:42:44] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 9 DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [15:43:44] (03CR) 10Fabfur: [C:03+2] profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [15:43:45] 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11728322 (10Aklapper) @CorinnaHillebrand_WMDE: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/external/), so your 'LDAP User' accoun... [15:47:11] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1255136/8308/" [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [15:47:45] (03PS1) 10Milimetric: testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) [15:48:09] (03CR) 10Phuedx: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack) [15:48:18] PROBLEM - Host ms-backup2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:18] PROBLEM - Host ms-backup2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:50] RECOVERY - Host ms-backup2003 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [15:49:08] (03CR) 10Ssingh: [V:03+1] "@mmuhlenhoff@wikimedia.org: I rebased this on master and wanted to merge this today. Can you please quickly review it again? Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [15:49:22] RECOVERY - Host ms-backup2004 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [15:51:21] (03CR) 10TChin: [C:03+1] testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric) [15:51:33] (03PS1) 10Dzahn: Revert "jenkins: define contint1003 as the manager_host for the jenkins role" [puppet] - 10https://gerrit.wikimedia.org/r/1255764 [15:52:16] (03CR) 10Phuedx: [C:03+1] testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric) [15:53:10] (03PS1) 10Jdlrobson: Implement addListener fallback for older browsers in matchMedia [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255765 (https://phabricator.wikimedia.org/T419717) [15:53:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4003.wikimedia.org with reason: host reimage [15:54:03] (03CR) 10Ssingh: [V:03+1] "By quickly I mean not urgently but that it is a quick review 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [15:56:30] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:57:56] (03CR) 10Dzahn: [C:03+2] Revert "jenkins: define contint1003 as the manager_host for the jenkins role" [puppet] - 10https://gerrit.wikimedia.org/r/1255764 (owner: 10Dzahn) [15:59:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4003.wikimedia.org with reason: host reimage [15:59:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [16:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1600). [16:00:05] phuedx and Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:08] (03CR) 10Muehlenhoff: [C:03+2] Make doh4003/doh4004 new wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255738 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [16:00:12] \o [16:00:17] o/ [16:01:32] o/ hi, looking [16:02:30] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2001.codfw.wmnet [16:03:12] (03PS1) 10Brouberol: kafka-mirrormaker: update base image to include prometheus-jmx-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255767 (https://phabricator.wikimedia.org/T417407) [16:05:01] !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:05:11] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5025.eqsin.wmnet with reason: host reimage [16:05:16] (03CR) 10RLazarus: [C:03+2] mw::maintenance: Remove ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1249932 (https://phabricator.wikimedia.org/T419428) (owner: 10Phuedx) [16:05:56] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage [16:06:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2001.codfw.wmnet [16:06:31] !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:06:45] Dreamy_Jazz: for https://gerrit.wikimedia.org/r/1255694, are you able to get a review from someone familiar with the subject matter? [16:06:59] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1142.eqiad.wmnet [16:07:08] !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:07:15] I can ask someone from my team to give a +1 if needed [16:07:21] I can review that this will indeed do *something* every day at midnight, but not that it'll do the right thing :) [16:07:34] yeah, that'd be appreciated -- if it takes longer than the puppet window just ping me, happy to still do it [16:08:16] !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:08:29] Pinged my team [16:08:30] assume you'd like me to go ahead with https://gerrit.wikimedia.org/r/1255687 in the meantime though? [16:08:37] Yeah, these are independent changes [16:08:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:41] 👍 [16:08:49] (03CR) 10RLazarus: [C:03+2] mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) (owner: 10Dreamy Jazz) [16:09:18] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5025.eqsin.wmnet with reason: host reimage [16:09:23] (03CR) 10Kosta Harlan: [C:03+1] mw::maintenance: Run purgeRecentChanges.php on wikis without CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [16:09:50] that was quick, thanks :) will go ahead [16:09:54] :D [16:10:03] That's what pings on Slack get you :D [16:10:10] would that it were always so [16:10:22] (03CR) 10RLazarus: [C:03+2] mw::maintenance: Run purgeRecentChanges.php on wikis without CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [16:10:44] merging these all at once, then we can wait for puppet on the deploy host only a single time [16:10:45] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2002.codfw.wmnet [16:10:56] will you want me to manually kick off a test run, or just wait until they fire naturally? [16:11:10] For mine, can wait till they fire naturally [16:11:19] cool, phuedx's is a no-op [16:11:28] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: enable JMX metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255755 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [16:11:32] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: update base image to include prometheus-jmx-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255767 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [16:11:36] (meant to say, thanks phuedx for the cleanup <3 easy to forget those) [16:11:45] No worries! [16:12:06] I would suggest to remove abstractwiki from the dblists until the addWiki bug is fixed. Our infrastructure does not really support wikis in preinstall without a db as seen. [16:13:40] (03Merged) 10jenkins-bot: kafka-mirrormaker: enable JMX metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255755 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [16:13:41] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage [16:13:42] (03Merged) 10jenkins-bot: kafka-mirrormaker: update base image to include prometheus-jmx-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255767 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [16:14:12] (03PS1) 10Jsn.sherman: Remove local configuration routing and loading [extensions/AutoModerator] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255772 (https://phabricator.wikimedia.org/T419835) [16:14:18] (03PS1) 10Jforrester: [abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255773 (https://phabricator.wikimedia.org/T420531) [16:14:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/AutoModerator] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255772 (https://phabricator.wikimedia.org/T419835) (owner: 10Jsn.sherman) [16:14:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2002.codfw.wmnet [16:14:52] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2003.codfw.wmnet [16:15:17] As soon as rzl's puppetting is over I'll try another fix for AW. [16:15:24] (03PS3) 10BBlack: Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) [16:15:38] James_F: go ahead, I'm still finishing up but nothing that'll conflict [16:15:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1142.eqiad.wmnet [16:16:05] Ack. [16:16:10] (and see also zabe's comment above, for awareness) [16:16:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4003.wikimedia.org with OS bookworm [16:16:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy4003.wikimedia.org [16:16:20] (03CR) 10BBlack: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack) [16:16:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255773 (https://phabricator.wikimedia.org/T420531) (owner: 10Jforrester) [16:17:03] (03CR) 10BBlack: [C:03+2] Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack) [16:17:17] !log jmm@cumin2002 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox [16:17:21] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [16:17:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [16:17:33] (03Merged) 10jenkins-bot: [abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255773 (https://phabricator.wikimedia.org/T420531) (owner: 10Jforrester) [16:17:51] (03PS2) 10Dzahn: ci::jenkins: add firewall rule to allow legacy machines to new jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521) [16:17:54] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255773|[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki (T420531)]] [16:17:59] T420531: addWiki.php fails with CannotReplaceActiveServiceException for DBLoadBalancerFactory - https://phabricator.wikimedia.org/T420531 [16:18:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2003.codfw.wmnet [16:19:47] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255773|[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki (T420531)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:20:06] !log jforrester@deploy2002 jforrester: Continuing with sync [16:20:13] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2041*} and A:cp - 3.2 test upgrade () [16:20:13] !log fabfur@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on P{cp2041*} and A:cp - 3.2 test upgrade () [16:20:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4004.wikimedia.org [16:20:29] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [16:20:55] (03CR) 10Muehlenhoff: Add fundraising-data-uploader role user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:21:36] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [16:21:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [16:21:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:22:02] (03CR) 10Eevans: [C:03+2] charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [16:22:19] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 378.75 ms [16:23:39] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:aqs-codfw [16:23:41] (03PS1) 10DCausse: airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091) [16:23:52] (03Merged) 10jenkins-bot: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [16:23:59] (03PS1) 10Jforrester: Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255779 [16:24:00] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255773|[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki (T420531)]] (duration: 06m 06s) [16:24:04] T420531: addWiki.php fails with CannotReplaceActiveServiceException for DBLoadBalancerFactory - https://phabricator.wikimedia.org/T420531 [16:24:37] Dreamy_Jazz: you're all set [16:24:38] (03CR) 10Jforrester: [C:03+2] Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255779 (owner: 10Jforrester) [16:24:39] (03PS3) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) [16:24:42] puppet window complete \o/ [16:24:46] Thanks! [16:24:56] (03CR) 10CDanis: Add fundraising-data-uploader role user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:24:58] jmm@cumin2002 makevm (PID 4108997) is awaiting input [16:25:25] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [16:25:42] Hmmmmm. [16:25:48] (03Merged) 10jenkins-bot: Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255779 (owner: 10Jforrester) [16:25:50] (03CR) 10CI reject: [V:04-1] airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091) (owner: 10DCausse) [16:26:01] How do I stop the helpful code redirecting me to incubator for abstract.wikipedia.org now that it exists? [16:26:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255779 (owner: 10Jforrester) [16:26:21] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255779|Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki"]] [16:26:22] (03PS8) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [16:27:07] FIRING: [2x] ProbeDown: Service aqs2001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:27:22] James_F: trying to help you out [16:27:41] mutante: Is it just a Varnish cache issue? [16:28:14] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255779|Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:28:16] well, I tried to exclude that option and ran a purge cache command [16:28:22] but maybe it isnt that [16:28:56] The MW-side rewrite code is in multiversion/missing.php [16:29:12] it's not a cached redirect, I get that when curling mw-web directly [16:29:16] yeah [16:29:21] < HTTP/2 302 [16:29:21] < date: Thu, 19 Mar 2026 16:28:56 GMT [16:29:21] < server: mw-web.codfw.main-5947f4dd7b-h86d8 [16:29:21] < cache-control: no-cache [16:29:22] < location: https://incubator.wikimedia.org/wiki/Wp/abstract?goto=mainpage [16:29:35] ^ MW is doing the redirect, it was a miss/pass situation in the cache [16:29:46] !log jforrester@deploy2002 jforrester: Continuing with sync [16:30:57] Oh, duh, it's still pre-installed. [16:31:02] James_F: abstractwiki is in preinstall.dblist, and multiversion/MWMultiVersion.php line 741 means everything in it gets redirected to incubator [16:31:05] .. yes [16:31:05] So it's behaving correctly. [16:31:07] Yeah. [16:31:12] James_F: how did you get around https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1255514 [16:31:22] I mean https://phabricator.wikimedia.org/T420531 [16:31:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:32:03] mutante: Just marked that as Resolved. Amir1 gave the idea of disabling the new extension first, which worked. [16:32:07] RESOLVED: [4x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:32:12] James_F: cool! ack [16:32:41] James_F: thank you very much for fixing that! [16:32:51] Time to activate the wiki. [16:33:11] 🎉 [16:33:32] (03PS1) 10Jforrester: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 [16:33:40] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255779|Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki"]] (duration: 07m 19s) [16:33:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:43] (03PS2) 10Jforrester: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 [16:33:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255765 (https://phabricator.wikimedia.org/T419717) (owner: 10Jdlrobson) [16:34:21] jmm@cumin2002 restart-reboot (PID 4108615) is awaiting input [16:34:24] (03PS3) 10Jforrester: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 (https://phabricator.wikimedia.org/T411723) [16:34:39] that commit message ("Activate Abstract Wikipedia") sounds very fancy [16:34:50] Doesn't it just? [16:34:51] (03PS4) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) [16:34:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 (https://phabricator.wikimedia.org/T411723) (owner: 10Jforrester) [16:35:14] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:35:30] (03CR) 10Andrew Bogott: [C:03+1] "This will probably help!" [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi) [16:35:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox [16:35:47] (03PS3) 10Dzahn: ci::jenkins: add firewall rule to allow legacy machines to new jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521) [16:35:51] (03Merged) 10jenkins-bot: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 (https://phabricator.wikimedia.org/T411723) (owner: 10Jforrester) [16:36:02] 🎉 [16:36:09] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255039|Activate Abstract Wikipedia (T411723)]] [16:36:13] T411723: Set up abstract.wikipedia.org as a new wiki - https://phabricator.wikimedia.org/T411723 [16:38:05] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255039|Activate Abstract Wikipedia (T411723)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:38:22] !log jforrester@deploy2002 jforrester: Continuing with sync [16:38:41] wikidata says "this is not a wiki" - guess we can update that in a minute [16:39:01] !log jmm@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [16:39:04] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [16:39:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1151.eqiad.wmnet [16:40:01] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_global in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:40:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5025.eqsin.wmnet with OS trixie [16:41:56] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-master1004.eqiad.wmnet [16:41:56] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp5025.eqsin.wmnet with reason: firmware updates [16:41:58] James_F: congratulations. https://lists.wikimedia.org/hyperkitty/list/newprojects@lists.wikimedia.org/thread/62EQX4JXMVNTFY6ROXNF2RH2YWEYN3Q3/ [16:42:06] Thanks! [16:42:10] FIRING: [8x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:18] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255039|Activate Abstract Wikipedia (T411723)]] (duration: 06m 09s) [16:42:22] FIRING: [8x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:42:23] T411723: Set up abstract.wikipedia.org as a new wiki - https://phabricator.wikimedia.org/T411723 [16:42:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4004.wikimedia.org - jmm@cumin2002" [16:43:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4004.wikimedia.org - jmm@cumin2002" [16:43:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:43:08] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy4004.wikimedia.org on all recursors [16:43:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy4004.wikimedia.org on all recursors [16:43:41] RESOLVED: [3x] JobUnavailable: Reduced availability for job netbox_global in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:43:41] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4004.wikimedia.org - jmm@cumin2002" [16:43:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4004.wikimedia.org - jmm@cumin2002" [16:43:47] hmm, where's the post-creation work task for abstractwiki? [16:44:01] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3 [16:44:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11728797 (10KFrancis) Hi all, I have sent the NDA out for signatures. I'll confirm when it's complete. Thanks! [16:44:07] (03PS5) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) [16:44:11] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:44:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4004.wikimedia.org with OS bookworm [16:44:40] (03CR) 10CI reject: [V:04-1] Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:45:08] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5026.eqsin.wmnet with OS trixie [16:45:25] FIRING: [4x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:50] (03PS1) 10Muehlenhoff: Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993) [16:46:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1151.eqiad.wmnet [16:46:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:46:55] (03CR) 10Ssingh: [C:03+1] Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [16:47:10] (03PS2) 10DCausse: airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091) [16:47:10] RESOLVED: [8x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:47:20] (03PS6) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) [16:47:24] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:48:13] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1255144/8309/" [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [16:48:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1004.eqiad.wmnet [16:48:41] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 71%, RTA = 4996.32 ms [16:50:05] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh4001.wikimedia.org [16:51:08] (03PS1) 10Brouberol: kafka-mirrormaker: ensure the right prometheus annotations are set on the pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255792 (https://phabricator.wikimedia.org/T417407) [16:51:56] (03PS1) 10Jcrespo: mariadb: Deploy grants for backup1 sections for new mediabackup workers [puppet] - 10https://gerrit.wikimedia.org/r/1255793 (https://phabricator.wikimedia.org/T420464) [16:52:02] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5025.eqsin.wmnet [16:52:10] FIRING: [8x] ProbeDown: Service aqs2002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:53:12] (03PS1) 10Jforrester: [abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255794 [16:53:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11728833 (10VRiley-WMF) [16:53:41] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: ensure the right prometheus annotations are set on the pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255792 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [16:53:45] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 431.07 ms [16:54:47] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [16:55:10] FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:55:25] FIRING: [4x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:43] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [16:57:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:57:09] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [16:57:10] RESOLVED: [8x] ProbeDown: Service aqs2002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:57:55] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [16:58:19] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [16:59:33] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [16:59:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [16:59:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh4001.wikimedia.org [16:59:57] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh4002.wikimedia.org [17:00:05] bd808: Time to snap out of that daydream and deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1700) [17:00:07] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11728845 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `doh4001.wikimedia.org` - doh4001... [17:00:10] o/ [17:00:22] I'll be doing a bit of testing in mw-debug during this infra window [17:00:43] 👀 [17:01:00] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5025.eqsin.wmnet [17:03:45] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5025.* [17:03:55] (03PS1) 10Dzahn: jenkins: pass srange as an array to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1255797 (https://phabricator.wikimedia.org/T418521) [17:04:39] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [17:05:06] I have a developer-portal build to ship in my window today. [17:05:08] (03PS1) 10Brouberol: kafka-mirrormaker: add the mirror_name pod label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255799 (https://phabricator.wikimedia.org/T417407) [17:05:10] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:05:13] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp5026.eqsin.wmnet with reason: firmware updates [17:05:27] (03CR) 10Dzahn: [C:03+2] jenkins: pass srange as an array to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1255797 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:05:40] (03PS1) 10BryanDavis: developer-portal: Bump to 2026-03-19-122408-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255800 [17:06:05] (03CR) 10Jcrespo: [C:03+2] mariadb: Deploy grants for backup1 sections for new mediabackup workers [puppet] - 10https://gerrit.wikimedia.org/r/1255793 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [17:07:10] FIRING: [8x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:07:22] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:07:22] FIRING: [8x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:07:31] (03PS1) 10Ladsgroup: Make the handler follow the thumb steps [extensions/3D] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255801 (https://phabricator.wikimedia.org/T414805) [17:07:34] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:08:24] (03PS26) 10Ryan Kemper: dse-k8s: Auto-set OpenSearch pod readahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [17:08:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [17:08:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4004.wikimedia.org with reason: host reimage [17:08:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [17:08:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:08:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh4002.wikimedia.org [17:08:58] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11728907 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `doh4002.wikimedia.org` - doh4002... [17:10:26] 10ops-codfw, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420613 (10phaultfinder) 03NEW [17:10:49] James_F: I think abstractwiki is missing the usual post-creation work task? [17:11:04] If you show me a template I'll make such a task. [17:11:31] usually the bot makes those, but T404567 [17:11:31] T404567: Post-creation work for tokwiki - https://phabricator.wikimedia.org/T404567 [17:11:39] There's a bot? [17:11:47] (03PS1) 10Jcrespo: mediabackups: Pool new worker hosts ms-backup1003 & ms-backup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464) [17:11:49] (03PS1) 10Jcrespo: mediabackups: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) [17:12:03] yes, one of the things https://phabricator.wikimedia.org/p/Maintenance_bot/ does [17:12:10] RESOLVED: [8x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:12] (03PS2) 10Jcrespo: mediabackups: Pool new worker hosts ms-backup1003 & ms-backup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464) [17:12:15] Fancy. [17:12:37] (03PS3) 10Jcrespo: mediabackups: Pool new worker hosts ms-backup1003 & ms-backup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464) [17:12:38] Surely we're not still doing RESTbase crap? [17:12:42] Oy veh. [17:12:44] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [17:12:46] (03CR) 10CI reject: [V:04-1] mediabackups: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [17:13:33] (03PS2) 10Jcrespo: mediabackup: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) [17:13:37] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [17:14:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4004.wikimedia.org with reason: host reimage [17:14:05] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2026-03-19-122408-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255800 (owner: 10BryanDavis) [17:15:29] (03PS1) 10Dzahn: Revert^2 "jenkins: define contint1003 as the manager_host for the jenkins role" [puppet] - 10https://gerrit.wikimedia.org/r/1255808 [17:15:38] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5026.eqsin.wmnet [17:15:45] (03PS3) 10Jcrespo: mediabackup: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) [17:15:48] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [17:16:09] (03Merged) 10jenkins-bot: developer-portal: Bump to 2026-03-19-122408-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255800 (owner: 10BryanDavis) [17:16:54] (03CR) 10Dzahn: [C:03+2] Revert^2 "jenkins: define contint1003 as the manager_host for the jenkins role" [puppet] - 10https://gerrit.wikimedia.org/r/1255808 (owner: 10Dzahn) [17:17:55] FIRING: [8x] ProbeDown: Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:18:42] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:19:12] (03CR) 10Jcrespo: [C:03+2] mediabackups: Pool new worker hosts ms-backup1003 & ms-backup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [17:19:37] jouncebot: nowandnext [17:19:37] For the next 0 hour(s) and 40 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1700) [17:19:37] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1700) [17:19:38] In 0 hour(s) and 40 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1800) [17:20:21] (03CR) 10Ryan Kemper: [C:03+1] "Okay, I cleaned up the commit message; I can't identify any further issues with this patch, and in any case we should get this merged and " [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [17:21:54] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:22:01] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:22:20] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:22:35] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:22:55] RESOLVED: [8x] ProbeDown: Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:22:57] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:24:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5026.eqsin.wmnet [17:24:38] (03CR) 10Jcrespo: [C:03+2] mediabackup: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [17:26:42] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on contint1003.wikimedia.org with reason: jenkins on java21 [17:28:10] FIRING: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:10] FIRING: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:22] (03CR) 10AOkoth: [C:03+2] "Yes, os-reports is listed a few lines above." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [17:29:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4004.wikimedia.org with OS bookworm [17:29:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy4004.wikimedia.org [17:30:36] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5026.* [17:32:45] (03PS3) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) [17:33:10] RESOLVED: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:10] RESOLVED: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:34:42] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11729033 (10herron) We chatted about this a bit at the o11y team meeting this week and consensus was that we're looking ok capacity wise, but would like to explore the potential... [17:36:06] (03PS4) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) [17:38:10] FIRING: [9x] ProbeDown: Service aqs2005-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:38:10] FIRING: [9x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:39:11] (03PS5) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) [17:39:43] (03PS1) 10AOkoth: ats: add wmf-navigator entry [puppet] - 10https://gerrit.wikimedia.org/r/1255818 (https://phabricator.wikimedia.org/T414405) [17:43:10] RESOLVED: [8x] ProbeDown: Service aqs2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:43:10] RESOLVED: [8x] ProbeDown: Service aqs2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:43:41] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11729059 (10Krinkle) For posterity, from [Grafana: Swift dashboard (Krinkle copy)](https://grafana-rw.wikimedia.org/d/75a174f3-44b6-4416-a8b8-201ad5a0c09f/swift-krinkle-copy): {F7315... [17:44:44] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet [17:44:46] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [17:45:01] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs1020.eqiad.wmnet [17:46:15] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:46:25] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:47:35] PROBLEM - Host logstash2033 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:35] RECOVERY - Host logstash2033 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms [17:51:34] (03PS1) 10Jforrester: SpecialAbstractContent: Fix hard-coded policy list page namespace [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255820 [17:53:10] FIRING: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:10] FIRING: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:49] PROBLEM - Host logstash2034 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:35] RECOVERY - Host logstash2034 is UP: PING OK - Packet loss = 0%, RTA = 30.60 ms [17:55:53] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet [17:55:58] FYI, I'm done with my testing for this window [17:56:29] Is there any way to stop MWMultiVersion "cleverly" merging extension-default config instead of over-writing it? [17:58:10] RESOLVED: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:10] RESOLVED: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:45] PROBLEM - Host logstash2036 is DOWN: PING CRITICAL - Packet loss = 100% [17:59:15] RECOVERY - Host logstash2036 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [18:00:05] andre and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1800). [18:00:12] nah [18:01:55] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:02:36] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet [18:03:41] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:03:55] FIRING: [8x] ProbeDown: Service aqs2008-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:45] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11729162 (10herron) [18:04:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial - https://phabricator.wikimedia.org/T420623 (10RobH) 03NEW [18:06:05] PROBLEM - Host logstash2037 is DOWN: PING CRITICAL - Packet loss = 100% [18:06:23] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11729180 (10herron) [18:06:35] RECOVERY - Host logstash2037 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [18:08:55] RESOLVED: [8x] ProbeDown: Service aqs2008-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:10:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11729203 (10RobH) [18:12:32] (03PS6) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) [18:12:36] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:12:47] PROBLEM - Host logstash2035 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:50] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11729209 (10herron) p:05Triage→03Medium [18:13:46] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11729213 (10herron) 05Open→03Resolved a:03herron The sloth onboarding backlog is empty! [18:14:15] RECOVERY - Host logstash2035 is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms [18:14:37] FIRING: [4x] ProbeDown: Service aqs2009-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:43] (03CR) 10Jcrespo: [C:03+2] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:16:12] (03PS1) 10Jforrester: RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255824 (https://phabricator.wikimedia.org/T420617) [18:16:24] jouncebot: nowandnext [18:16:24] For the next 1 hour(s) and 43 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1800) [18:16:24] In 1 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T2000) [18:16:29] OK, will deploy. [18:18:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255824 (https://phabricator.wikimedia.org/T420617) (owner: 10Jforrester) [18:18:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255820 (owner: 10Jforrester) [18:18:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255794 (owner: 10Jforrester) [18:19:10] RESOLVED: [8x] ProbeDown: Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:55] (03Merged) 10jenkins-bot: [abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255794 (owner: 10Jforrester) [18:22:43] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:23:04] (03CR) 10CI reject: [V:04-1] RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255824 (https://phabricator.wikimedia.org/T420617) (owner: 10Jforrester) [18:23:13] brett: reboot? [18:23:23] PROBLEM - pybal on lvs1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:23:39] (03Merged) 10jenkins-bot: RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255824 (https://phabricator.wikimedia.org/T420617) (owner: 10Jforrester) [18:24:02] sukhe: Yeah, but it shouldn't be lvs1020 - cjd91 did you do lvs1020? [18:24:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255820 (owner: 10Jforrester) [18:24:19] (03Merged) 10jenkins-bot: SpecialAbstractContent: Fix hard-coded policy list page namespace [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255820 (owner: 10Jforrester) [18:24:24] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=112) https://wikitech.wikimedia.org/wiki/PyBal [18:24:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:24:32] yeah. sorry, I'll fix it [18:24:39] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255824|RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 (T420617)]], [[gerrit:1255820|SpecialAbstractContent: Fix hard-coded policy list page namespace]], [[gerrit:1255794|[abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT]] [18:24:44] T420617: RecentChanges on Abstract Wikipedia links to users are wrong - https://phabricator.wikimedia.org/T420617 [18:25:24] RECOVERY - pybal on lvs1020 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:25:39] (03PS1) 10Jcrespo: mediabackup: Apply final role for eqiad mediabackup new storages [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506) [18:25:42] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:25:55] FIRING: [3x] ProbeDown: Service aqs2010-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:59] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [18:26:38] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255824|RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 (T420617)]], [[gerrit:1255820|SpecialAbstractContent: Fix hard-coded policy list page namespace]], [[gerrit:1255794|[abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now b [18:26:38] e verified there. [18:27:02] !log jforrester@deploy2002 jforrester: Continuing with sync [18:29:10] FIRING: [8x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:15] (03PS2) 10Jcrespo: mediabackup: Apply final role for eqiad mediabackup new storages [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506) [18:29:24] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 112 connections established with conf1007.eqiad.wmnet:4001 (min=112) https://wikitech.wikimedia.org/wiki/PyBal [18:29:51] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [18:30:55] FIRING: [8x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:30:59] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255824|RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 (T420617)]], [[gerrit:1255820|SpecialAbstractContent: Fix hard-coded policy list page namespace]], [[gerrit:1255794|[abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT]] (duration: 06m 20s) [18:31:04] T420617: RecentChanges on Abstract Wikipedia links to users are wrong - https://phabricator.wikimedia.org/T420617 [18:32:16] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [18:32:26] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:32:32] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:32:43] ^known [18:32:52] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [18:34:10] RESOLVED: [8x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:36] (03CR) 10Jcrespo: [C:03+2] mediabackup: Apply final role for eqiad mediabackup new storages [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [18:35:47] cjd91: Did you run the downtime cookbook? [18:38:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:55] brett: everything ok with the lvs? I was gonna make some addiitons to vlans in eqiad row D, won't affect that but if there is some connectivity problem or incident I'll hold off just in case [18:40:10] FIRING: [8x] ProbeDown: Service aqs2011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:40:30] also holding further deployments just in case [18:41:36] topranks: There's no incident, no - there's reboots going on [18:41:46] cjd91, can you run the downtime cookbook? [18:41:49] ok cool I'll proceed in that case [18:41:57] brett: just being cautious thanks! [18:42:01] thanks for checking! [18:42:07] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:aqs-codfw [18:42:09] I will wait a bit for alerts to clear, I need visibility in case I myself throw errors [18:43:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:44:12] !log cdobbins@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs1019.eqiad.wmnet with reason: planned reboot [18:44:20] nice, thanks! [18:44:41] sorry about the delay [18:45:10] RESOLVED: [8x] ProbeDown: Service aqs2011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:46:56] 10ops-eqiad, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T420634 (10phaultfinder) 03NEW [18:49:08] !log add vlan sub-interface for analytics1-d-eqiad vlan to leaf switches in eqiad row d T405562 [18:50:00] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase-codfw [18:50:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11729465 (10Jclark-ctr) a:03Jclark-ctr [18:50:55] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet [18:53:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11729490 (10Jclark-ctr) 05Open→03Resolved [18:53:04] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 6 hosts with reason: kernel module reload [18:54:14] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet [18:54:27] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:54:33] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:55:52] FIRING: [10x] ProbeDown: Service aqs2012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:53] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [18:57:36] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:57:52] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 82 connections established with conf1007.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [19:00:16] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add analytic vlan hostnames - cmooney@cumin1003" [19:00:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add analytic vlan hostnames - cmooney@cumin1003" [19:00:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:00:52] RESOLVED: [10x] ProbeDown: Service aqs2012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:01:42] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11729624 (10thcipriani) > I need this in order to be able to access live db for query optimization while writing new code Do you mean for performance metrics and maintenance scripts? O... [19:01:54] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:02:16] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [19:04:07] !log disable IPv6 router-advertisements on eqiad core routers for analytics1-d-eqiad vlan T405562 [19:04:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:06:44] FIRING: [12x] ProbeDown: Service restbase2024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:45] 10ops-eqiad, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T420645 (10phaultfinder) 03NEW [19:11:44] RESOLVED: [12x] ProbeDown: Service restbase2024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:13:05] (03PS1) 10Cathal Mooney: analytics1-d-eqiad vlan: cease sending RAs on CRs and DHCP relay [homer/public] - 10https://gerrit.wikimedia.org/r/1255835 (https://phabricator.wikimedia.org/T405562) [19:13:21] (03PS6) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) [19:14:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:16:31] FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:17:10] FIRING: [12x] ProbeDown: Service restbase2024-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:04] (03PS7) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) [19:21:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:21:44] FIRING: [12x] ProbeDown: Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:22:10] FIRING: [12x] ProbeDown: Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:44] RESOLVED: [12x] ProbeDown: Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:28:41] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:30:01] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:30:02] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:30:08] hmm yeah [19:30:14] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:30:32] ACKNOWLEDGEMENT - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git Sukhbir Singh gerrit down https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:30:33] ACKNOWLEDGEMENT - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git Sukhbir Singh gerrit down https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:31:44] FIRING: [12x] ProbeDown: Service restbase2026-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:32:55] FIRING: [12x] ProbeDown: Service restbase2026-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:41] RESOLVED: [3x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:34:02] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:34:02] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:34:19] (03PS8) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) [19:34:22] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [19:34:45] (03CR) 10Cathal Mooney: [C:03+2] Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [19:35:02] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:35:14] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:35:36] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1018.eqiad.wmnet with reason: reboots [19:35:58] (03CR) 10Cathal Mooney: [C:03+2] "@Arzhel I messed up here, meant to self-merage a different patch. Let me know what you think here I'm happy to revise." [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [19:36:02] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:36:02] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:36:31] (03Merged) 10jenkins-bot: Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [19:36:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:36:44] RESOLVED: [12x] ProbeDown: Service restbase2026-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:36:57] !log stopping pybal/puppet on lvs1018 for reboots [19:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:08] (03CR) 10Jcrespo: [C:03+1] mediabackups: Open s3 storage port on storage hosts from working hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [19:37:39] FIRING: CoreBGPDown: Core BGP session down between cr2-esams and cr1-eqiad (185.15.59.144) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr1-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:38:05] (03CR) 10Cathal Mooney: [C:03+2] analytics1-d-eqiad vlan: cease sending RAs on CRs and DHCP relay [homer/public] - 10https://gerrit.wikimedia.org/r/1255835 (https://phabricator.wikimedia.org/T405562) (owner: 10Cathal Mooney) [19:38:29] (03PS1) 10Catrope: testwiki: Add temporary groups for security testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255847 [19:38:36] (03CR) 10Jcrespo: [C:03+2] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [19:39:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:41:07] (03Merged) 10jenkins-bot: analytics1-d-eqiad vlan: cease sending RAs on CRs and DHCP relay [homer/public] - 10https://gerrit.wikimedia.org/r/1255835 (https://phabricator.wikimedia.org/T405562) (owner: 10Cathal Mooney) [19:41:44] FIRING: [13x] ProbeDown: Service restbase2026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:42:50] RESOLVED: CoreBGPDown: Core BGP session down between cr2-esams and cr1-eqiad (185.15.59.144) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr1-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:44:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:44:48] !log disable IPv6 VRRP for et-1/0/5.1023 sub-interfaces on eqiad core routers T405562 [19:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:53] T405562: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562 [19:45:51] (03PS8) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) [19:46:40] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:46:42] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:46:44] RESOLVED: [12x] ProbeDown: Service restbase2027-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:47:26] (03CR) 10CI reject: [V:04-1] mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [19:48:27] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [19:49:00] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:51:13] ^ due to gerrit restart.. but should resolve [19:51:23] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 7 hosts with reason: kernel module reload [19:52:36] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:53:44] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache 4.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa. on all recursors [19:53:47] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 4.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa. on all recursors [19:54:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:54:29] FIRING: [12x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:54:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:55:18] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:56:24] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet [19:56:30] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:56:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:56:43] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:59:29] FIRING: [12x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet [19:59:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:35] (03CR) 10Jcrespo: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [19:59:46] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T2000). [20:00:06] arlolra, katherine_g, hector-arroyo, JSherman, and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:00:40] o/ [20:00:46] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:00:48] o/ [20:01:08] o/ [20:01:14] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:01:15] hey, there is some unstability on gerrit [20:01:18] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add analytic vlan hostnames - cmooney@cumin1003" [20:01:23] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add analytic vlan hostnames - cmooney@cumin1003" [20:01:23] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:01:41] repos may provide errors at the moment [20:01:44] RESOLVED: [12x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:02:12] if there is enough time, you may want to wait a bit for deployment [20:02:26] lvs errors are fine to be ignored, icinga is removing downtimes after reboots when it shouldn [20:02:28] shouldn't [20:03:20] no, gerrit errors [20:03:24] (for clarity that's two separate things -- LVS errors are ignorable, gerrit recovery is still in progress) [20:03:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:03:35] er, yeah, thanks r :) [20:03:47] ok I can hold off [20:04:00] I can wait [20:06:44] FIRING: [9x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:08:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:55] FIRING: [12x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:17] (03PS2) 10Catrope: testwiki: Add temporary groups for security testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255847 [20:11:41] rzl: will there be a clear go ahead signal when we're good to go? [20:11:51] !log cdobbins@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs1016.eqiad.wmnet with reason: reboot [20:12:07] JSherman: it seems things are better now, but waiting for some time to confirm it is ok [20:14:55] RESOLVED: [12x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:15:25] FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:23] I've disabled pybal on lvs1016. it shouldn't take longer than 15-20 minutes [20:19:35] katherine_g: it sounds like we can probably get started then? [20:19:55] FIRING: [11x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:19:58] JSherman: ok getting started then [20:20:16] um, should I start? [20:20:45] arlolra: yeah sorry! [20:20:46] arlolra: not trying to line jump! sorry! [20:20:53] :) [20:21:13] if you're confident, you can deploy both config changes at once [20:21:15] or I can [20:21:44] FIRING: [12x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:22:28] arlolra: I can do both at once if that's ok [20:22:34] thanks [20:23:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) (owner: 10Kgraessle) [20:23:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) (owner: 10Arlolra) [20:24:44] (03Merged) 10jenkins-bot: Deploy Extension:PersonalDashboard to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) (owner: 10Kgraessle) [20:25:22] (03Merged) 10jenkins-bot: Deploy PRV to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) (owner: 10Arlolra) [20:25:40] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet [20:25:44] !log kgraessle@deploy2002 Started scap sync-world: Backport for [[gerrit:1254865|Deploy Extension:PersonalDashboard to English Wikipedia (T418367)]], [[gerrit:1253654|Deploy PRV to 13 wikis (T420273)]] [20:25:52] T418367: Deploy Extension:PersonalDashboard to English Wikipedia - https://phabricator.wikimedia.org/T418367 [20:25:52] T420273: Parsoid Read Views to deploy ~2026-03-19 - https://phabricator.wikimedia.org/T420273 [20:26:44] FIRING: [12x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:27:41] !log kgraessle@deploy2002 kgraessle, arlolra: Backport for [[gerrit:1254865|Deploy Extension:PersonalDashboard to English Wikipedia (T418367)]], [[gerrit:1253654|Deploy PRV to 13 wikis (T420273)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:27:59] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet [20:28:34] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:28:46] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [20:29:09] arlolra: synced to test servers [20:29:18] looking [20:29:48] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:29:55] RESOLVED: [12x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:08] katherine_g: lgtm [20:30:34] RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:32:40] !log kgraessle@deploy2002 kgraessle, arlolra: Continuing with sync [20:34:55] FIRING: [12x] ProbeDown: Service restbase2031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:36:44] !log kgraessle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254865|Deploy Extension:PersonalDashboard to English Wikipedia (T418367)]], [[gerrit:1253654|Deploy PRV to 13 wikis (T420273)]] (duration: 11m 00s) [20:36:51] T418367: Deploy Extension:PersonalDashboard to English Wikipedia - https://phabricator.wikimedia.org/T418367 [20:36:51] T420273: Parsoid Read Views to deploy ~2026-03-19 - https://phabricator.wikimedia.org/T420273 [20:37:18] hector-arroyo: we're done, over to you [20:37:50] katherine_g: thank you [20:37:58] arlolra: np [20:39:55] RESOLVED: [12x] ProbeDown: Service restbase2031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:40:11] when I try to deploy the change clicking on the spiderpig link I get an access denied error, I think I will need help with this [20:43:41] hector-arroyo: I'll have a look [20:44:07] is it because of gerrit instability? [20:44:25] gerrit is being slow again [20:44:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:44:40] ^ there it is [20:44:53] thanks [20:45:07] so, I think we're stuck [20:46:44] FIRING: [9x] ProbeDown: Service restbase2032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:46:56] with ~15 minutes left in the window, I don't think we're getting any more backports out [20:47:57] sorry for the gerrit issues [20:48:33] people are still working on it [20:48:36] jynus: I know everybody is doing their best! [20:49:55] FIRING: [12x] ProbeDown: Service restbase2032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:52:04] my change is to test something in testwiki, it's not a big deal if it is deployed next week [20:54:55] RESOLVED: [12x] ProbeDown: Service restbase2032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:55:01] FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:58:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:59:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T2100) [21:00:55] FIRING: [12x] ProbeDown: Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:03:37] jouncebot: nowandnext [21:03:37] For the next 0 hour(s) and 56 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T2100) [21:03:37] In 8 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260320T0600) [21:04:30] hector-arroyo: it could be done now [21:04:37] because the following window is empty [21:04:50] gerrit should be doing better [21:05:25] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:05:55] FIRING: [12x] ProbeDown: Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:15] Hey all is it good to deploy? [21:06:22] mutante: I need the Web Team deployment window for a few things. [21:06:35] I can do other deploys if there are outstanding ones from the backport window. [21:06:44] RESOLVED: [12x] ProbeDown: Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:06:44] Jdlrobson: Gerrit is down-ish, so no deploys right now I think. [21:07:09] Oh, I defer to mutante then. [21:07:13] well.. Gerrit should be better now. [21:07:18] but that window looked empty [21:07:29] and the deployers before missed their window due to the gerrit issue [21:07:30] gotcha. Ok down-ish was the bit I was missing. Do we know when it might be back by? We have a bad bug impacting editors that would be best not to leave over the weekend. [21:07:48] I don't mind doing extra deployments once we're stable [21:08:19] Jdlrobson: hmm. do it! [21:08:36] mutante: so we're good with Gerrit? What needs deploying? [21:09:09] I see katherine_g: arlolra here but none of the other people with changes to deploy. [21:09:27] Jdlrobson: Gerrit should be ok again. 2 people were here but then left by now [21:09:40] Jdlrobson: you can do your own change [21:09:41] ok ill start with my user bug if that's okay? [21:09:49] yea [21:10:26] jdlrobson: mine and arlolras changes were already deployed so we're done [21:10:42] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 11 hosts with reason: kernel module reload [21:11:03] katherine_g: thanks for confirming! [21:11:06] I can ping jason [21:11:14] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2020.codfw.wmnet with reason: kernel module reload [21:15:25] FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:10] FIRING: [12x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:21:10] RESOLVED: [12x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:21:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:22:24] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1255881|Skins: Address issue with blurry images for large thumbnails (T375981)]] [21:22:29] T375981: Preferences settings for small image size are not being respected for Parsoid Read Views - https://phabricator.wikimedia.org/T375981 [21:22:49] Hey cstone thanks for the review, do you mind to take a look for the smashpig first then I can do a version update for di and civi [21:24:17] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1255881|Skins: Address issue with blurry images for large thumbnails (T375981)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:25:32] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [21:26:10] FIRING: [12x] ProbeDown: Service restbase2035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:26:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:29:27] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255881|Skins: Address issue with blurry images for large thumbnails (T375981)]] (duration: 07m 03s) [21:29:32] T375981: Preferences settings for small image size are not being respected for Parsoid Read Views - https://phabricator.wikimedia.org/T375981 [21:31:10] RESOLVED: [12x] ProbeDown: Service restbase2035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:33:50] Jdlrobson: Once you're done, I have a deploy. [21:34:06] James_F: sounds good. Just this one. Hopefully wont take long [21:34:10] Sure, no worries. [21:40:12] James_F: when you're done, could you ping me? I have a change I'd like to get (does not require scap, just some helmfile'ing on mw-web) [21:40:16] Of course. [21:41:10] FIRING: [12x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:46:10] FIRING: [12x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:46:44] RESOLVED: [12x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:48:03] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1255765|Implement addListener fallback for older browsers in matchMedia (T419717)]] [21:48:08] T419717: TypeError: mq.addEventListener is not a function. (In 'mq.addEventListener('change',listener)', 'mq.addEventListener' is undefined) - https://phabricator.wikimedia.org/T419717 [21:48:21] PROBLEM - Host logging-hd1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:49:49] RECOVERY - Host logging-hd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [21:49:53] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1255765|Implement addListener fallback for older browsers in matchMedia (T419717)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:51:26] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [21:51:42] James_F: syncing now. All yours when done [21:52:55] <3 [21:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:55:20] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255765|Implement addListener fallback for older browsers in matchMedia (T419717)]] (duration: 07m 17s) [21:55:31] T419717: TypeError: mq.addEventListener is not a function. (In 'mq.addEventListener('change',listener)', 'mq.addEventListener' is undefined) - https://phabricator.wikimedia.org/T419717 [21:56:09] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5024.eqsin.wmnet [reason: trixie reimaging] [21:56:10] FIRING: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:56:58] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS trixie [21:57:22] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5019.eqsin.wmnet [reason: trixie reimaging] [21:57:36] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:restbase-codfw [21:58:02] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5019.eqsin.wmnet with OS trixie [22:01:10] RESOLVED: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:01:29] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255886|Set WikiLambdaAbstractNamespaces's merge_strategy to provide_default (T420649)]] [22:01:34] T420649: When publishing an Abstract Wikipedia article, it is stored in the wrong Namespace - https://phabricator.wikimedia.org/T420649 [22:03:20] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255886|Set WikiLambdaAbstractNamespaces's merge_strategy to provide_default (T420649)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:04:22] !log jforrester@deploy2002 jforrester: Continuing with sync [22:06:56] FIRING: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:07:25] swfrench-wmf: Over to you once this sync completes [22:07:37] James_F: awesome, thank you! [22:08:15] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255886|Set WikiLambdaAbstractNamespaces's merge_strategy to provide_default (T420649)]] (duration: 06m 46s) [22:08:20] T420649: When publishing an Abstract Wikipedia article, it is stored in the wrong Namespace - https://phabricator.wikimedia.org/T420649 [22:12:53] FYI, I'll be deploying a change to mw-web shortly [22:12:58] I'll follow up here when done [22:16:26] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [22:17:44] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [22:18:19] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [22:19:43] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [22:21:56] RESOLVED: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:22:05] I am done [22:23:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:23:56] FIRING: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [22:24:31] PROBLEM - Host logging-hd1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:27:01] RECOVERY - Host logging-hd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [22:28:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:37:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.512s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:42:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.235s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:42:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.132s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:48:19] !log zabe@deploy2002 mwscript-k8s job started: foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https # T420643 [22:48:25] T420643: Add Wikidata support for abstractwiki - https://phabricator.wikimedia.org/T420643 [22:52:45] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 828.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:53:56] FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [23:18:23] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5024.eqsin.wmnet with OS trixie [23:19:27] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5019.eqsin.wmnet with OS trixie [23:19:34] zabe: sorry if that is my fault - was just trying to fill in the post creation tasks since the bot didn't do it [23:28:59] jouncebot: nowandnext [23:28:59] No deployments scheduled for the next 6 hour(s) and 31 minute(s) [23:29:00] In 6 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260320T0600) [23:33:51] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1255801|Make the handler follow the thumb steps (T414805)]] [23:33:56] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [23:35:44] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1255801|Make the handler follow the thumb steps (T414805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:36:12] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [23:40:06] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255801|Make the handler follow the thumb steps (T414805)]] (duration: 06m 14s) [23:40:10] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [23:59:58] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5019.eqsin.wmnet with OS trixie