[00:01:28] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon1003.eqiad.wmnet
[00:02:46] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host kafkamon2003.codfw.wmnet
[00:06:44] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2003.codfw.wmnet
[00:13:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "works now. noop on both sides and puppet resources are managed on both sides. in the filesystem there are timers but no services on the so" [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[00:20:17] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1251205/8301/" [puppet] - 10https://gerrit.wikimedia.org/r/1251205 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[00:23:57] <wikibugs>	 (03CR) 10Dzahn: "This will be the actual switch from old to new jenkins now." [puppet] - 10https://gerrit.wikimedia.org/r/1254308 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[00:25:29] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] jenkins: define contint1003 as the manager_host for the jenkins role [puppet] - 10https://gerrit.wikimedia.org/r/1254295 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[00:45:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader)
[00:50:00] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:50:43] <codenamenoreste>	 I have rescheduled T419312's deployment for 1:00 to 2:00 AM where I live (CDT in Texas), 7:00 to 8:00 AM in UTC
[00:50:44] <stashbot>	 T419312: Addition of AbuseFilter blocking for the Portuguese Wikipedia - https://phabricator.wikimedia.org/T419312
[00:51:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:54:26] <wikibugs>	 (03PS1) 10Dzahn: jenkins: allow rsyncing of data for migrating a jenkins server [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521)
[00:54:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] jenkins: allow rsyncing of data for migrating a jenkins server [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[00:55:45] <wikibugs>	 (03PS1) 10Dzahn: jenkins: remove httpd profile from role [puppet] - 10https://gerrit.wikimedia.org/r/1255139
[00:56:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:00:37] <wikibugs>	 (03PS1) 10Dzahn: ci::jenkins: add firewall rule to allow legacy machines to new jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521)
[01:01:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ci::jenkins: add firewall rule to allow legacy machines to new jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[01:05:00] <wikibugs>	 (03PS1) 10Dzahn: ci::firewall: stop using IPs instead of host names [puppet] - 10https://gerrit.wikimedia.org/r/1255153
[01:13:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:18:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[01:44:25] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[01:50:03] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[01:54:36] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[02:08:41] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:33:41] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:45:03] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[03:00:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (2800:1e0:1025::10e) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Transit6&var-bgp_neighbor=EdgeUno - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[03:05:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[03:35:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[03:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:47:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:50:00] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:53:13] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[05:44:25] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T0600).
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T0700).
[07:00:05] <jouncebot>	 codenamenoreste: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:07:42] <codenamenoreste>	 I am available to test 1251200 and deploy today, if needed
[07:14:38] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[07:14:56] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[07:15:08] <codenamenoreste>	 Amir1 and urbanecm
[07:16:13] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[07:16:34] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[07:17:11] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[07:17:35] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[07:21:44] <wikibugs>	 (03PS1) 10AOkoth: miscweb: fix helmfile add wmf-navigator to releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405)
[07:23:41] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:26:27] <wikibugs>	 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409#11726514 (10ayounsi) 05Open→03Resolved Indeed! The errors were happening with the same levels of traffic as we have now, so looks like i...
[07:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:36:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1048.eqiad.wmnet
[07:36:22] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: adjust mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189)
[07:36:22] <wikibugs>	 (03CR) 10Arnaudb: "The initial idea behind that change was to test our working theory on `MaxRequestWorkers`. I've updated the change to fit what's documente" [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb)
[07:39:09] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3980845) is awaiting input
[07:47:26] <wikibugs>	 (03PS1) 10AOkoth: miscweb: add wmf-navigator aux ingress record [dns] - 10https://gerrit.wikimedia.org/r/1255523 (https://phabricator.wikimedia.org/T414405)
[07:48:29] <codenamenoreste>	 well, no deployment today -_-
[07:52:15] <wikibugs>	 (03PS2) 10AOkoth: miscweb: add wmf-navigator aux ingress record [dns] - 10https://gerrit.wikimedia.org/r/1255523 (https://phabricator.wikimedia.org/T414405)
[07:58:22] <codenamenoreste>	 I'm still available
[07:58:58] <andre>	 I can maybe backport after the train deployment to be done very soon
[07:59:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11726540 (10Martyn.ranyard) As the EM of Annie's cross-functional team at WMDE, I approve this request.  @katiamusiolekwmde has not yet got their phabricator...
[08:00:05] <jouncebot>	 andre and brennen: Your horoscope predicts another MediaWiki train - Utc-0+Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T0800).
[08:00:10] <andre>	 o/
[08:00:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564
[08:00:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564 (owner: 10Muehlenhoff)
[08:02:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564 (https://phabricator.wikimedia.org/T418993)
[08:03:19] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255576 (https://phabricator.wikimedia.org/T413811)
[08:03:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255576 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[08:04:33] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255576 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[08:05:39] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3980845) is awaiting input
[08:07:59] <moritzm>	 ml-etcd1002,dse-k8s-etcd1003 will go down for a Ganeti reboot
[08:08:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1048.eqiad.wmnet
[08:08:19] <wikibugs>	 (03PS1) 10Slyngshede: C:external_clouds_vendors remove GeekyWorld [puppet] - 10https://gerrit.wikimedia.org/r/1255580
[08:09:50] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:09:50] <icinga-wm>	 PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:10:43] <wikibugs>	 (03CR) 10Elukey: [C:03+1] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254867 (owner: 10Muehlenhoff)
[08:10:58] <icinga-wm>	 RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[08:10:58] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[08:11:50] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148)
[08:12:12] <logmsgbot>	 !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.20  refs T413811
[08:12:16] <stashbot>	 T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811
[08:13:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1048.eqiad.wmnet
[08:13:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1048.eqiad.wmnet
[08:13:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11726559 (10Martyn.ranyard) @KFrancis could you organize the NDA signature for this request ? Thanks
[08:14:33] <moritzm>	 !log installing imagemagick security updates on Bullseye
[08:14:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Apply installserver role to install4004 [puppet] - 10https://gerrit.wikimedia.org/r/1255582 (https://phabricator.wikimedia.org/T418993)
[08:19:24] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Add api.w.o device-analytics support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255590 (https://phabricator.wikimedia.org/T418147)
[08:21:44] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply
[08:21:50] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply
[08:25:29] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: apply
[08:25:49] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[08:26:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add new doh/hcaptcha-proxy VMs to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1255564 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[08:26:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: apply
[08:27:00] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Apply installserver role to install4004 [puppet] - 10https://gerrit.wikimedia.org/r/1255582 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[08:29:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh4003.wikimedia.org
[08:29:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:31:43] <moritzm>	 !log installing python-apt security updates
[08:31:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002"
[08:34:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002"
[08:34:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:34:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors
[08:34:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors
[08:35:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:37:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11726601 (10MPostoronca-WMF)
[08:37:42] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11726603 (10MPostoronca-WMF) >>! In T420458#11723037, @ayounsi wrote: > @OKryva-WMF do you approve this request ? > @thcipriani do you approve this request ? > @MPostoronca-WMF could yo...
[08:38:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh4003.wikimedia.org - jmm@cumin2002"
[08:38:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh4003.wikimedia.org - jmm@cumin2002"
[08:38:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:38:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors
[08:38:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors
[08:38:56] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh4003.wikimedia.org
[08:39:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet
[08:40:05] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti4007.ulsfo.wmnet
[08:41:34] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11726622 (10MatthewVernon) @hnowlan can I push this up your stack, please? Willy wants all procurement requests for next FY done by end of next week (i.e. 27 March).
[08:42:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install4003.wikimedia.org to plain
[08:42:45] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726624 (10ops-monitoring-bot) VM install4003.wikimedia.org switching disk type to plain
[08:43:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install4003.wikimedia.org to plain
[08:43:21] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] miscweb: add wmf-navigator aux ingress record [dns] - 10https://gerrit.wikimedia.org/r/1255523 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[08:44:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of hcaptcha-proxy4002.wikimedia.org to plain
[08:44:46] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726626 (10ops-monitoring-bot) VM hcaptcha-proxy4002.wikimedia.org switching disk type to plain
[08:45:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of hcaptcha-proxy4002.wikimedia.org to plain
[08:45:10] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] miscweb: fix helmfile add wmf-navigator to releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[08:45:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of hcaptcha-proxy4001.wikimedia.org to plain
[08:46:15] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726629 (10ops-monitoring-bot) VM hcaptcha-proxy4001.wikimedia.org switching disk type to plain
[08:46:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of hcaptcha-proxy4001.wikimedia.org to plain
[08:46:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh4002.wikimedia.org to plain
[08:47:38] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[08:47:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:48:22] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726632 (10ops-monitoring-bot) VM doh4002.wikimedia.org switching disk type to plain
[08:48:38] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[08:48:40] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on hcaptcha-proxy4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[08:48:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh4002.wikimedia.org to plain
[08:49:25] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:49:35] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest-gateway: update readme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848
[08:49:38] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on hcaptcha-proxy4001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[08:50:28] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[08:52:28] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[08:54:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh4001.wikimedia.org to plain
[08:54:25] <jinxer-wm>	 FIRING: [16x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:54:40] <jinxer-wm>	 FIRING: [16x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:56:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti4007 from classic Ganeti cluster in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255603 (https://phabricator.wikimedia.org/T418993)
[08:56:28] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726637 (10ops-monitoring-bot) VM doh4001.wikimedia.org switching disk type to plain
[08:56:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh4001.wikimedia.org to plain
[08:58:00] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove ganeti4007 from classic Ganeti cluster in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255603 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[08:58:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11726639 (10Sarmbruster) Just signed the NDA via docusign.
[08:58:26] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:00:26] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh4001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:01:50] <moritzm>	 !log remove ganeti4007 from classic Ganeti cluster in ulsfo T418993
[09:01:52] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[09:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:55] <stashbot>	 T418993: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993
[09:04:25] <jinxer-wm>	 FIRING: [20x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:04:26] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti4007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[09:04:26] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti4007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[09:05:50] <jinxer-wm>	 FIRING: ProbeDown: Service ganeti4007:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:06:54] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 222.32 ms
[09:08:39] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: migrate codfw memcached cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255610 (https://phabricator.wikimedia.org/T398611)
[09:09:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ganeti4007 from classic Ganeti cluster in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255603 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[09:10:41] <wikibugs>	 (03PS1) 10Brouberol: Revert^2 "kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255612
[09:11:59] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw
[09:13:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4007.ulsfo.wmnet with OS bookworm
[09:13:29] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726657 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4007.ulsfo.wmnet with...
[09:14:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Revert^2 "kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255612 (owner: 10Brouberol)
[09:14:25] <jinxer-wm>	 RESOLVED: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:15:47] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Revert^2 "kafka-mirrormaker: migrate logging-{eqiad,codfw}->jumbo-eqiad to aux-eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255612 (owner: 10Brouberol)
[09:15:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "That is really great, many thanks for pushing this forward!" [puppet] - 10https://gerrit.wikimedia.org/r/1255610 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[09:19:17] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[09:19:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[09:21:20] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[09:21:43] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[09:24:02] <logmsgbot>	 !log klausman@cumin2002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-codfw
[09:26:12] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw
[09:26:29] <wikibugs>	 (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: add missing releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255620 (https://phabricator.wikimedia.org/T417407)
[09:26:54] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw
[09:28:42] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726685 (10ayounsi)
[09:29:03] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-eqiad
[09:31:43] <icinga-wm>	 PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:33:15] <wikibugs>	 (03CR) 10Jaime Nuche: "Make sense, thanks for the improvement @dzahn@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[09:33:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.13 point update - https://phabricator.wikimedia.org/T414205#11726713 (10MoritzMuehlenhoff)
[09:35:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4007.ulsfo.wmnet with reason: host reimage
[09:35:37] <moritzm>	 !log installing libnginx-mod-http-lua security updates
[09:35:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:41] <icinga-wm>	 RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:39:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4007.ulsfo.wmnet with reason: host reimage
[09:42:15] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@863e5c2] (releasing): T420477
[09:43:02] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@863e5c2] (releasing): T420477 (duration: 00m 59s)
[09:43:24] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[09:45:34] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@863e5c2] (releasing): T420477
[09:46:41] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@863e5c2] (releasing): T420477 (duration: 01m 07s)
[09:46:44] <icinga-wm>	 PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:49:39] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add missing releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255620 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[09:53:05] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet
[09:53:13] <wikibugs>	 (03PS3) 10Fabfur: haproxy: test haproxy32 on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825)
[09:53:24] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[09:53:44] <icinga-wm>	 RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:56:20] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] wikidata-platform: wdqs-queryhammer helmfile deployment (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[09:56:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] aux-k8s/kafka-mirrormaker: add missing releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255620 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[09:57:00] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet
[09:57:10] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467) (owner: 10Daniel Kinzler)
[09:58:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti4007 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255644 (https://phabricator.wikimedia.org/T418993)
[09:58:28] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 17 hosts with reason: upgrade
[09:58:37] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] hieradata: migrate codfw memcached cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255610 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[09:59:22] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467) (owner: 10Daniel Kinzler)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1000)
[10:00:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4007.ulsfo.wmnet with OS bookworm
[10:00:41] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726772 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4007.ulsfo.wmnet with OS b...
[10:02:44] <icinga-wm>	 PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:03:59] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:04:41] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:08:44] <icinga-wm>	 RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:09:13] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.opensearch.roll-restart-reboot rolling reboot on A:datahubsearch
[10:09:39] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Make ganeti4007 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255644 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[10:10:06] <wikibugs>	 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11726795 (10jcrespo)
[10:10:07] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420308#11726798 (10jcrespo) →14Duplicate dup:03T419970
[10:10:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2007.codfw.wmnet
[10:10:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make ganeti4007 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1255644 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[10:13:30] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.reboot-single for host clouddumps1001.wikimedia.org
[10:14:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2007.codfw.wmnet
[10:16:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2008.wikimedia.org
[10:18:24] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:19:41] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:20:16] <wikibugs>	 (03CR) 10Btullis: wikidata-platform: wdqs-queryhammer chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[10:20:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2008.wikimedia.org
[10:21:46] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1001.wikimedia.org
[10:21:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2006.codfw.wmnet
[10:22:15] <wikibugs>	 (03CR) 10Btullis: [C:03+2] wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[10:22:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet
[10:23:38] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] miscweb: fix helmfile add wmf-navigator to releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[10:23:55] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[10:24:43] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=99) rolling reboot on A:cephosd-eqiad
[10:25:00] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling reboot on A:datahubsearch
[10:25:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2006.codfw.wmnet
[10:26:56] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:28:40] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.reboot-single for host clouddumps1002.wikimedia.org
[10:28:49] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:29:08] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on P{cephosd100[4-5]*} and (A:cephosd-codfw or A:cephosd-eqiad)
[10:30:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet
[10:31:47] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
[10:32:26] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
[10:32:46] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet
[10:33:41] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
[10:33:45] <icinga-wm>	 PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:34:09] <logmsgbot>	 !log aokoth@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
[10:36:14] <wikibugs>	 (03PS1) 10Federico Ceratto: wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1255655 (https://phabricator.wikimedia.org/T416705)
[10:36:20] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[10:36:44] <Raine>	 !log created temporary categorylinks_icu72 tables -- T419980, T419049
[10:36:45] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet
[10:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:49] <stashbot>	 T419980: ICU 72 upgrade: `categorylinks` table swap - https://phabricator.wikimedia.org/T419980
[10:36:50] <stashbot>	 T419049: Upgrade the MediaWiki servers to ICU 72 ☂️ - https://phabricator.wikimedia.org/T419049
[10:37:04] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1002.wikimedia.org
[10:37:07] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet
[10:37:11] <wikibugs>	 06SRE-OnFire, 10Cite, 10VisualEditor, 10WMDE-TechWish-Maintenance, and 3 others: Investigation: Write visual editor debug tool to produce Converter test cases - https://phabricator.wikimedia.org/T400311#11726876 (10WMDE-Fisch) @awight maybe we close this ticket and abandon leftover patches for now? 🤔
[10:39:35] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet
[10:40:45] <icinga-wm>	 RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:41:29] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet
[10:42:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4007.ulsfo.wmnet to cluster ulsfo02 and group 01
[10:43:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4007.ulsfo.wmnet to cluster ulsfo02 and group 01
[10:43:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh4003.wikimedia.org
[10:43:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:44:24] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11726885 (10MoritzMuehlenhoff)
[10:45:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet
[10:45:30] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet
[10:46:02] <wikibugs>	 (03PS1) 10Brouberol: kafka-main-codfw: disable mirroring to kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407)
[10:46:04] <wikibugs>	 (03PS1) 10Brouberol: kafka-main-eqiad: disable mirroring to kafka-main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407)
[10:46:07] <wikibugs>	 (03PS1) 10Brouberol: kafka-jumbo-eqiad: disable mirroring from kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255658 (https://phabricator.wikimedia.org/T417407)
[10:47:54] <wikibugs>	 (03PS2) 10Brouberol: kafka-main-eqiad: disable mirroring to kafka-main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407)
[10:47:54] <wikibugs>	 (03PS2) 10Brouberol: kafka-main-codfw: disable mirroring to kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407)
[10:47:54] <wikibugs>	 (03PS2) 10Brouberol: kafka-jumbo-eqiad: disable mirroring from kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255658 (https://phabricator.wikimedia.org/T417407)
[10:48:00] <wikibugs>	 (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqiad-to-main-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255659 (https://phabricator.wikimedia.org/T417407)
[10:48:03] <wikibugs>	 (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407)
[10:48:05] <wikibugs>	 (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407)
[10:48:10] <wikibugs>	 (03PS1) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407)
[10:48:45] <icinga-wm>	 PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:48:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet
[10:49:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2004.codfw.wmnet
[10:50:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002"
[10:50:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002"
[10:50:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:50:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors
[10:50:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors
[10:50:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:51:08] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-codfw
[10:53:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2004.codfw.wmnet
[10:54:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh4003.wikimedia.org - jmm@cumin2002"
[10:54:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM doh4003.wikimedia.org - jmm@cumin2002"
[10:54:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:54:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors
[10:54:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors
[10:54:46] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh4003.wikimedia.org
[10:55:43] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on P{cephosd100[4-5]*} and (A:cephosd-codfw or A:cephosd-eqiad)
[10:55:45] <icinga-wm>	 RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:58:21] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11726917 (10fgiunchedi) FWIW I found some prior art / ideas here {T367592}
[10:59:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet
[11:03:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet
[11:05:07] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw
[11:07:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
[11:08:41] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:09:51] <wikibugs>	 (03CR) 10Ladsgroup: "You can use https://switchmaster.toolforge.org/dc-switch to create this and it's much safer since it's automatic." [dns] - 10https://gerrit.wikimedia.org/r/1255655 (https://phabricator.wikimedia.org/T416705) (owner: 10Federico Ceratto)
[11:10:28] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11726958 (10jijiki)
[11:11:05] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add main-eqiad-to-main-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255659 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[11:11:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
[11:11:42] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[11:12:12] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[11:12:36] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[11:13:28] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "Precautionary -1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[11:13:41] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:14:14] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[11:15:01] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:17:36] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255668 (https://phabricator.wikimedia.org/T420448)
[11:18:01] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd-codfw
[11:18:09] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-codfw
[11:19:28] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705)
[11:20:48] <icinga-wm>	 PROBLEM - BFD status on lsw1-a7-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:21:55] <wikibugs>	 (03CR) 10Muehlenhoff: "(We'll keep that up when Jelto is back)" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto)
[11:25:01] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:25:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to +2 on operations/deployment-charts for trueg and lerickson - https://phabricator.wikimedia.org/T420568 (10trueg) 03NEW
[11:26:29] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage
[11:26:48] <wikibugs>	 (03CR) 10TChin: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255668 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton)
[11:27:30] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot)
[11:28:41] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:28:41] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:28:48] <icinga-wm>	 RECOVERY - BFD status on lsw1-a7-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:30:18] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1017.eqiad.wmnet with reason: host reimage
[11:32:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[11:33:07] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255668 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton)
[11:34:37] <wikibugs>	 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11727019 (10Nikerabbit) 05In progress→03Resolved
[11:35:14] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255668 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton)
[11:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:36:48] <icinga-wm>	 PROBLEM - BFD status on lsw1-c2-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:40:13] <wikibugs>	 (03PS4) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415)
[11:40:35] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:40:44] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:41:26] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[11:41:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[11:42:48] <icinga-wm>	 RECOVERY - BFD status on lsw1-c2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:43:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572 (10MoritzMuehlenhoff) 03NEW
[11:44:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572#11727103 (10MoritzMuehlenhoff) p:05Triage→03Medium
[11:44:48] <wikibugs>	 (03PS1) 10Michael Große: createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916)
[11:45:02] <wikibugs>	 (03PS1) 10Michael Große: CreateAccount: Add class to aide in instrumentation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255686
[11:45:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große)
[11:46:53] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[11:46:58] <wikibugs>	 (03PS5) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028)
[11:47:10] <wikibugs>	 (03PS2) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407)
[11:47:10] <wikibugs>	 (03PS2) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407)
[11:47:10] <wikibugs>	 (03PS2) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407)
[11:47:20] <wikibugs>	 (03CR) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[11:47:27] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[11:47:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[11:48:13] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[11:48:15] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[11:48:17] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255658 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[11:48:18] <wikibugs>	 (03CR) 10Trueg: [C:03+2] wikidata-platform: wdqs-queryhammer helmfile deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[11:48:28] <wikibugs>	 (03PS6) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028)
[11:48:32] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[11:49:57] <logmsgbot>	 btullis@cumin1003 reimage (PID 342152) is awaiting input
[11:50:08] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-platform: wdqs-queryhammer helmfile deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[11:51:48] <icinga-wm>	 PROBLEM - BFD status on lsw1-d2-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:51:48] <wikibugs>	 (03PS1) 10Dreamy Jazz: mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571)
[11:53:06] <Dreamy_Jazz>	 jouncebot: next
[11:53:06] <jouncebot>	 In 0 hour(s) and 6 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1200)
[11:53:37] <moritzm>	 !log upgrade rpki2003 to Routinator 0.15.1 T420572
[11:53:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:42] <stashbot>	 T420572: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572
[11:54:56] <wikibugs>	 (03PS2) 10Dreamy Jazz: mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571)
[11:55:03] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[11:55:24] <wikibugs>	 (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) (owner: 10Dreamy Jazz)
[11:55:52] <wikibugs>	 (03PS1) 10JMeybohm: wikikube: Add wikikube-worker[1335-1349].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1255689 (https://phabricator.wikimedia.org/T418259)
[11:57:12] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[11:57:48] <icinga-wm>	 RECOVERY - BFD status on lsw1-d2-codfw.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:57:55] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-codfw
[11:58:21] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd-codfw
[11:59:54] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[11:59:54] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[12:00:01] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1200)
[12:00:20] <wikibugs>	 (03CR) 10Michael Große: "recheck" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große)
[12:00:44] <wikibugs>	 (03PS1) 10Muehlenhoff: versitygw: Don't set file ownership for root:root [puppet] - 10https://gerrit.wikimedia.org/r/1255690
[12:00:47] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] wikikube: Add wikikube-worker[1335-1349].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1255689 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm)
[12:03:40] <wikibugs>	 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 2 others: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11727165 (10MoritzMuehlenhoff) >>! In T410028#11714176, @jcrespo wrote: > For this, the...
[12:03:41] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:04:25] <wikibugs>	 (03PS1) 10Btullis: Put dse-k8s-worker101[6-7] back into service [puppet] - 10https://gerrit.wikimedia.org/r/1255692 (https://phabricator.wikimedia.org/T414787)
[12:04:57] <wikibugs>	 (03Merged) 10jenkins-bot: Remove PSP related code from admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248823 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:07:25] <wikibugs>	 (03PS1) 10Dreamy Jazz: mw::maintenance: Run purgeRecentChanges.php on wikis without CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062)
[12:07:43] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] versitygw: Don't set file ownership for root:root [puppet] - 10https://gerrit.wikimedia.org/r/1255690 (owner: 10Muehlenhoff)
[12:08:19] <wikibugs>	 (03PS2) 10Dreamy Jazz: mw::maintenance: Run purgeRecentChanges.php on wikis without CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062)
[12:09:24] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Put dse-k8s-worker101[6-7] back into service [puppet] - 10https://gerrit.wikimedia.org/r/1255692 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis)
[12:10:03] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[12:10:06] <logmsgbot>	 !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:reassignMentees --wiki=enwiki --mentor=Bilorv --performer=Bilorv --as-job  # T418194
[12:10:10] <stashbot>	 T418194: Mentors still having mentees after removing themselves - https://phabricator.wikimedia.org/T418194
[12:10:30] <wikibugs>	 (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz)
[12:10:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255686 (owner: 10Michael Große)
[12:11:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große)
[12:12:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch backup1015 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255697 (https://phabricator.wikimedia.org/T410028)
[12:15:03] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[12:17:20] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest-gateway: Add api.w.o device-analytics support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255590 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert)
[12:21:16] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Switch backup new media storage hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255704 (https://phabricator.wikimedia.org/T410028)
[12:22:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1255704 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:22:15] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:22:55] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:23:11] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:23:18] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255708
[12:24:19] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:25:02] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:25:40] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:27:46] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:27:55] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[12:28:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:28:22] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Switch backup new media storage hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255704 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:29:34] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Fix PodSecurityPolicy related comments [puppet] - 10https://gerrit.wikimedia.org/r/1250524 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[12:29:38] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:29:41] <wikibugs>	 (03PS2) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148)
[12:31:05] <wikibugs>	 (03PS7) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028)
[12:31:13] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[12:31:14] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:31:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:33:10] <jayme>	 jynus: good to merge "mediabackup: Switch backup new media storage hosts to nftables" ?
[12:33:22] <jynus>	 yes
[12:33:29] <jynus>	 sorry, too many things ongoing
[12:33:34] <jayme>	 np, done
[12:33:41] <jynus>	 I was about to
[12:33:52] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[12:34:10] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Switch backup1015 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255697 (https://phabricator.wikimedia.org/T410028) (owner: 10Muehlenhoff)
[12:34:30] <wikibugs>	 (03CR) 10Jcrespo: "Thank you so much Moritz for your help!" [puppet] - 10https://gerrit.wikimedia.org/r/1255697 (https://phabricator.wikimedia.org/T410028) (owner: 10Muehlenhoff)
[12:36:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet
[12:37:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh4003.wikimedia.org
[12:37:21] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply
[12:37:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:37:51] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[12:38:31] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[12:39:12] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[12:39:54] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[12:40:28] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[12:40:46] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[12:40:59] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[12:41:12] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[12:41:13] <wikibugs>	 (03PS13) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[12:41:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm7001.magru.wmnet
[12:41:39] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[12:41:56] <wikibugs>	 (03PS14) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[12:42:15] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:42:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727260 (10brouberol) {F73147012} We can see that sockets are no longer leaking after the NIC replace...
[12:42:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove testvm7001 [puppet] - 10https://gerrit.wikimedia.org/r/1255718 (https://phabricator.wikimedia.org/T396864)
[12:42:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727262 (10brouberol) 05Open→03Resolved a:03brouberol
[12:42:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet
[12:43:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11727266 (10brouberol) a:05brouberol→03BTullis
[12:43:28] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[12:43:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002"
[12:43:58] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[12:44:05] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[12:44:18] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[12:44:27] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[12:44:41] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[12:44:49] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[12:45:03] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[12:45:11] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[12:45:26] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[12:45:43] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[12:46:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[12:46:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:46:25] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-codfw
[12:46:45] <logmsgbot>	 jmm@cumin2002 makevm (PID 4054560) is awaiting input
[12:46:57] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[12:47:13] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[12:47:34] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1016.eqiad.wmnet
[12:47:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:48:09] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[12:48:28] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[12:49:41] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[12:50:04] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[12:50:10] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[12:50:24] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:50:32] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[12:50:47] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[12:51:17] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[12:51:42] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[12:52:00] <logmsgbot>	 jmm@cumin2002 decommission (PID 4055145) is awaiting input
[12:52:39] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[12:52:54] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[12:53:27] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1016.eqiad.wmnet
[12:53:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1017.eqiad.wmnet
[12:54:40] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[12:57:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4003.wikimedia.org - jmm@cumin2002"
[12:57:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:57:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4003.wikimedia.org on all recursors
[12:57:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4003.wikimedia.org on all recursors
[12:58:32] <wikibugs>	 (03PS1) 10Cathal Mooney: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088)
[12:59:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[12:59:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[12:59:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:59:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm7001.magru.wmnet
[12:59:53] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1017.eqiad.wmnet
[13:00:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[13:00:03] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1300).
[13:00:05] <jouncebot>	 MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:06] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Skip references to Debian package versitygw until available [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028)
[13:00:26] <MichaelG_WMF>	 Hey 👋
[13:00:41] <urbanecm>	 MichaelG_WMF: i can deploy today
[13:00:46] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Skip references to Debian package versitygw until available [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028)
[13:00:50] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:00:57] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) (owner: 10Dreamy Jazz)
[13:00:58] <MichaelG_WMF>	 Thanks urbanecm :)
[13:01:05] <wikibugs>	 (03CR) 10STran: [C:03+1] mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) (owner: 10Dreamy Jazz)
[13:01:17] <urbanecm>	 MichaelG_WMF: any objections if i deploy both patches at the same time?
[13:01:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:01:36] <MichaelG_WMF>	 nope, that makes sense
[13:01:50] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] CreateAccount: Add class to aide in instrumentation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255686 (owner: 10Michael Große)
[13:01:51] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große)
[13:01:52] <moritzm>	 !log installing rsync security updates
[13:01:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:04:12] <wikibugs>	 (03Merged) 10jenkins-bot: CreateAccount: Add class to aide in instrumentation [extensions/GrowthExperiments] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255686 (owner: 10Michael Große)
[13:04:13] <wikibugs>	 (03PS2) 10Cathal Mooney: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088)
[13:04:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große)
[13:04:16] <urbanecm>	 MichaelG_WMF: CI dislikes the WikimediaEvents patch :/
[13:04:39] <urbanecm>	 14:03:32   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/vendor/': GnuTLS recv error (-54): Error in the pull function.'
[13:04:43] <urbanecm>	 seems unrelated...
[13:04:45] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Skip references to Debian package versitygw until available [puppet] - 10https://gerrit.wikimedia.org/r/1255723 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:04:51] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] "..." [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große)
[13:05:02] <wikibugs>	 (03PS3) 10Clément Goubert: rest-gateway: Add linkrecommendation support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255581 (https://phabricator.wikimedia.org/T418148)
[13:05:02] <wikibugs>	 (03PS1) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146)
[13:05:04] <MichaelG_WMF>	 yeah, that would be strange. it was fine in test just moments ago
[13:05:36] <urbanecm>	 rerunning
[13:05:36] <wikibugs>	 (03PS3) 10Cathal Mooney: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088)
[13:07:14] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 17 hosts with reason: upgrade
[13:07:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große)
[13:08:07] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <ENTER RESOURCE NAME> for <ENTER YOUR USERNAME> - https://phabricator.wikimedia.org/T420578 (10CorinnaHillebrand_WMDE) 03NEW
[13:08:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254867 (owner: 10Muehlenhoff)
[13:09:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh4003.wikimedia.org - jmm@cumin2002"
[13:09:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh4003.wikimedia.org - jmm@cumin2002"
[13:09:22] <wikibugs>	 (03Merged) 10jenkins-bot: createAccount: Log exposure and CTRs for account creation experiment [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255685 (https://phabricator.wikimedia.org/T419916) (owner: 10Michael Große)
[13:09:37] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <ENTER RESOURCE NAME> for <ENTER YOUR USERNAME> - https://phabricator.wikimedia.org/T420578#11727365 (10CorinnaHillebrand_WMDE) @Hany.elmokadem as soon as you're back, could you give me your approval as my manager here?
[13:09:50] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1255686|CreateAccount: Add class to aide in instrumentation]], [[gerrit:1255685|createAccount: Log exposure and CTRs for account creation experiment (T419916)]]
[13:09:53] <stashbot>	 T419916: [V1 experiment release] Redesign mobile web account creation form following Codex guidelines - https://phabricator.wikimedia.org/T419916
[13:12:20] <logmsgbot>	 jmm@cumin2002 makevm (PID 4054560) is awaiting input
[13:12:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh4003.wikimedia.org with OS bookworm
[13:13:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove testvm7001 [puppet] - 10https://gerrit.wikimedia.org/r/1255718 (https://phabricator.wikimedia.org/T396864) (owner: 10Muehlenhoff)
[13:13:41] <logmsgbot>	 !log urbanecm@deploy2002 migr, urbanecm: Backport for [[gerrit:1255686|CreateAccount: Add class to aide in instrumentation]], [[gerrit:1255685|createAccount: Log exposure and CTRs for account creation experiment (T419916)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:14:02] * MichaelG_WMF is looking
[13:14:20] <urbanecm>	 ty
[13:15:22] <MichaelG_WMF>	 urbanecm: looks good 👍
[13:15:40] <logmsgbot>	 !log urbanecm@deploy2002 migr, urbanecm: Continuing with sync
[13:15:44] <urbanecm>	 proceeding
[13:22:14] <moritzm>	 !log upgrade rpki1001 to Routinator 0.15.1 T420572
[13:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:19] <stashbot>	 T420572: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572
[13:22:44] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth: update check for authdns_update_run [puppet] - 10https://gerrit.wikimedia.org/r/1255038 (owner: 10Ssingh)
[13:22:48] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255686|CreateAccount: Add class to aide in instrumentation]], [[gerrit:1255685|createAccount: Log exposure and CTRs for account creation experiment (T419916)]] (duration: 12m 58s)
[13:22:52] <stashbot>	 T419916: [V1 experiment release] Redesign mobile web account creation form following Codex guidelines - https://phabricator.wikimedia.org/T419916
[13:23:13] <urbanecm>	 MichaelG_WMF: done
[13:23:34] <MichaelG_WMF>	 urbanecm: Thank you!
[13:28:41] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:29:30] <wikibugs>	 (03PS3) 10Ssingh: Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[13:31:40] <phuedx>	 If there's room in the window, I've got a backport to do
[13:31:44] <phuedx>	 Just waiting on CI
[13:32:32] <urbanecm>	 phuedx: go ahead
[13:32:49] <phuedx>	 Ta
[13:33:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh4003.wikimedia.org with reason: host reimage
[13:37:27] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:37:29] <wikibugs>	 (03PS4) 10Jforrester: Expose new wikifunctions.v0 REST API module on Wikifunctions.org only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053)
[13:38:04] <James_F>	 phuedx: When you're done, please shout.
[13:38:33] <phuedx>	 James_F: Go for it. I'm fighting with CI at this point
[13:38:37] <James_F>	 Ack.
[13:38:41] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:38:46] <wikibugs>	 (03Abandoned) 10Jcrespo: Revert^4 "garage: Add a first role and profile" [puppet] - 10https://gerrit.wikimedia.org/r/1212080 (owner: 10Jcrespo)
[13:38:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh4003.wikimedia.org with reason: host reimage
[13:39:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) (owner: 10Jforrester)
[13:39:19] <wikibugs>	 (03Merged) 10jenkins-bot: cassandra-http-gateway: new chart based on aqs-http-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250649 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:39:27] <wikibugs>	 (03PS2) 10Clément Goubert: rest-gateway: Add core API support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146)
[13:39:55] <wikibugs>	 (03PS1) 10Daniel Kinzler: api-gateway: add Lua hooks mechanism for rest_gateway_routes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731
[13:39:55] <wikibugs>	 (03CR) 10Eevans: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:40:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[13:40:20] <wikibugs>	 (03Merged) 10jenkins-bot: Expose new wikifunctions.v0 REST API module on Wikifunctions.org only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250107 (https://phabricator.wikimedia.org/T419053) (owner: 10Jforrester)
[13:40:40] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1250107|Expose new wikifunctions.v0 REST API module on Wikifunctions.org only (T419053)]]
[13:40:45] <stashbot>	 T419053: Add REST module for Wikifunctions - https://phabricator.wikimedia.org/T419053
[13:40:51] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Switch backup media worker hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255732 (https://phabricator.wikimedia.org/T410028)
[13:40:55] <wikibugs>	 (03PS6) 10Eevans: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112)
[13:40:55] <wikibugs>	 (03PS7) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112)
[13:41:29] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Switch backup media worker hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255732 (https://phabricator.wikimedia.org/T410028)
[13:41:32] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255732 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:42:32] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1250107|Expose new wikifunctions.v0 REST API module on Wikifunctions.org only (T419053)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:42:51] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[13:43:18] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11727544 (10CorinnaHillebrand_WMDE)
[13:46:36] <James_F>	 phuedx: Over to you; good luck with CI wrestling.
[13:46:43] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1250107|Expose new wikifunctions.v0 REST API module on Wikifunctions.org only (T419053)]] (duration: 06m 03s)
[13:46:47] <stashbot>	 T419053: Add REST module for Wikifunctions - https://phabricator.wikimedia.org/T419053
[13:46:49] <phuedx>	 Many thanks <3
[13:47:13] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Switch backup media worker hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1255732 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:47:46] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: test haproxy32 on cp2041 [puppet] - 10https://gerrit.wikimedia.org/r/1254195 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[13:48:21] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:48:58] <wikibugs>	 (03PS4) 10Cathal Mooney: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088)
[13:49:22] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:49:22] <wikibugs>	 (03PS1) 10Jforrester: Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113
[13:49:22] <wikibugs>	 (03CR) 10Jforrester: "Hey @aschulz@wikimedia.org, I used this nicer style for wiki-specific config for the Wikifunctions API config and it works well. I've made" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester)
[13:49:27] <wikibugs>	 (03PS2) 10Jforrester: Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113
[13:50:32] <wikibugs>	 (03PS1) 10Jforrester: Move GrowthExperiments REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250114
[13:50:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Upgrade Routinator to 0.15.1 - https://phabricator.wikimedia.org/T420572#11727581 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All RPKI host are upgraded and Cathal confirmed it's all working fine
[13:52:39] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:52:42] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:52:56] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:52:58] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:53:08] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:53:12] <wikibugs>	 (03PS1) 10Fabfur: haproxy: fix lua lib version with haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1255735 (https://phabricator.wikimedia.org/T419825)
[13:54:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh4003.wikimedia.org with OS bookworm
[13:54:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh4003.wikimedia.org
[13:55:32] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255735 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[13:58:11] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] haproxy: fix lua lib version with haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1255735 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[13:58:30] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: fix lua lib version with haproxy 3.2 [puppet] - 10https://gerrit.wikimedia.org/r/1255735 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[13:58:33] <phuedx>	 jouncebot next
[13:58:33] <jouncebot>	 In 0 hour(s) and 31 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1430)
[13:59:50] <wikibugs>	 (03PS1) 10Kosta Harlan: hcaptcha: Use the global edit key for MobileFrontend edits if present [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574)
[14:00:26] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[14:01:43] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[14:02:33] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kafka-main-eqiad: disable mirroring to kafka-main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255657 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[14:02:49] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kafka-main-codfw: disable mirroring to kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255656 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[14:02:59] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kafka-jumbo-eqiad: disable mirroring from kafka-main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1255658 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[14:03:12] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:04:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host doh4004.wikimedia.org
[14:04:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:05:45] <wikibugs>	 06SRE, 06Traffic: Anycast ns[01].wikimedia.org for IPv4 - https://phabricator.wikimedia.org/T366193#11727707 (10cmooney) >>! In T366193#11713908, @ssingh wrote: >>> I think we should clean up stuff in the interim though since it will be a while before we can get our hands on the /24. I will need your help with...
[14:06:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Make doh4003/doh4004 new wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255738 (https://phabricator.wikimedia.org/T418993)
[14:09:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:10:04] <Daimona>	 ^Yeah FWIW gerrit seems to be struggling
[14:11:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4004.wikimedia.org - jmm@cumin2002"
[14:12:28] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply
[14:13:14] <arnaudb>	 Daimona: do you have issues with ip4? the issue seem to only be with ipv6
[14:13:40] <Daimona>	 It seems to be working normally now
[14:13:48] <logmsgbot>	 !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply
[14:13:54] <Daimona>	 I had like a massive slowdown for ~5 minutes, haven't checked anything tho
[14:14:02] <arnaudb>	 ack
[14:14:30] <logmsgbot>	 jmm@cumin2002 makevm (PID 4074555) is awaiting input
[14:14:31] <jinxer-wm>	 RESOLVED: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:17:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh4004.wikimedia.org - jmm@cumin2002"
[14:17:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:17:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache doh4004.wikimedia.org on all recursors
[14:17:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh4004.wikimedia.org on all recursors
[14:17:51] <wikibugs>	 (03PS1) 10BBlack: Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586)
[14:17:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh4004.wikimedia.org - jmm@cumin2002"
[14:18:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh4004.wikimedia.org - jmm@cumin2002"
[14:18:07] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp4043 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:18:16] <logmsgbot>	 !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply
[14:18:24] <wikibugs>	 (03PS2) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqiad-to-main-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255659 (https://phabricator.wikimedia.org/T417407)
[14:18:24] <wikibugs>	 (03PS3) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-codfw-to-main-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255660 (https://phabricator.wikimedia.org/T417407)
[14:18:25] <wikibugs>	 (03PS3) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407)
[14:18:25] <wikibugs>	 (03PS3) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407)
[14:19:07] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance backend on cp4043 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:19:25] <logmsgbot>	 !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply
[14:20:10] <logmsgbot>	 !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply
[14:20:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host doh4004.wikimedia.org with OS bookworm
[14:21:29] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:21:37] <logmsgbot>	 !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[14:22:02] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-conf1004.eqiad.wmnet
[14:22:49] <wikibugs>	 (03PS1) 10Fabfur: profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825)
[14:23:11] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Anycast: prepend once more when peering with the core routers [homer/public] - 10https://gerrit.wikimedia.org/r/1254185 (https://phabricator.wikimedia.org/T420342) (owner: 10Ayounsi)
[14:24:57] <wikibugs>	 (03PS4) 10Brouberol: aux-k8s/kafka-mirrormaker: add main-eqad-to-jumbo-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255661 (https://phabricator.wikimedia.org/T417407)
[14:24:57] <wikibugs>	 (03PS4) 10Brouberol: aux-k8s/kafka-mirrormaker: cleanup helmfile of duplicated namespace definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255662 (https://phabricator.wikimedia.org/T417407)
[14:25:17] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=dse-k8s-worker1012.eqiad.wmnet|dse-k8s-worker1015.eqiad.wmnet|dse-k8s-worker1016.eqiad.wmnet|dse-k8s-worker1017.eqiad.wmnet
[14:26:10] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[14:27:25] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1004.eqiad.wmnet
[14:29:17] <wikibugs>	 (03PS8) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464)
[14:29:29] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[14:29:57] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-conf1005.eqiad.wmnet
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1430)
[14:30:44] <wikibugs>	 06SRE-OnFire, 10Cite, 10VisualEditor, 10WMDE-TechWish-Maintenance, and 3 others: Investigation: Write visual editor debug tool to produce Converter test cases - https://phabricator.wikimedia.org/T400311#11727887 (10awight) 05Open→03Resolved a:03awight Great!  We have some more fine-tuning to make...
[14:31:43] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.20 ms
[14:31:46] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[14:31:50] <wikibugs>	 (03PS1) 10Phuedx: Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a [extensions/TestKitchen] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255747
[14:31:55] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:31:55] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:32:13] <sukhe>	 why is that :P 
[14:32:23] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes:weight=1; selector: name=dse-k8s-worker1010.eqiad.wmnet|dse-k8s-worker1011.eqiad.wmnet|dse-k8s-worker1012.eqiad.wmnet|dse-k8s-worker1013.eqiad.wmnet|dse-k8s-worker1015.eqiad.wmnet|dse-k8s-worker1016.eqiad.wmnet|dse-k8s-worker1017.eqiad.wmnet|dse-k8s-worker1018.eqiad.wmnet|dse-k8s-worker1019.eqiad.wmnet
[14:33:05] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia: fix bug in how DHCP relay config was generated [homer/public] - 10https://gerrit.wikimedia.org/r/1255722 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[14:34:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) (owner: 10Kosta Harlan)
[14:35:27] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1005.eqiad.wmnet
[14:38:07] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:38:36] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:40:00] <wikibugs>	 (03PS1) 10Cathal Mooney: Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088)
[14:40:30] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:41:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[14:41:53] <wikibugs>	 (03PS2) 10BBlack: Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586)
[14:42:35] <kostajh>	 jouncebot: nowandnext
[14:42:36] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1430)
[14:42:36] <jouncebot>	 In 0 hour(s) and 17 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1500)
[14:43:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh4004.wikimedia.org with reason: host reimage
[14:43:15] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack)
[14:44:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11727944 (10OKryva-WMF) Approved.
[14:46:06] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-conf1006.eqiad.wmnet
[14:46:50] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+1] "`0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack)
[14:48:21] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 265.74 ms
[14:49:01] <wikibugs>	 (03CR) 10BCornwall: "+1 in the sense that the code seems sound - no idea about the hostname accuracy." [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot)
[14:49:10] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot)
[14:49:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh4004.wikimedia.org with reason: host reimage
[14:51:21] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1006.eqiad.wmnet
[14:51:37] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1001.eqiad.wmnet
[14:52:15] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet
[14:54:36] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host matomo1003.eqiad.wmnet
[14:55:11] <wikibugs>	 (03PS4) 10Fabfur: profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825)
[14:55:25] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1001.eqiad.wmnet
[14:55:51] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1002.eqiad.wmnet
[14:56:14] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet
[14:57:53] <wikibugs>	 (03PS1) 10Brouberol: kafka-mirrormaker: enable JMX metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255755 (https://phabricator.wikimedia.org/T417407)
[14:58:35] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1003.eqiad.wmnet
[14:59:37] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1002.eqiad.wmnet
[15:00:01] <wikibugs>	 (03CR) 10Phuedx: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack)
[15:00:04] <jouncebot>	 andre and brennen: #bothumor I � Unicode. All rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1500).
[15:01:15] <wikibugs>	 (03CR) 10BCornwall: "Could this all be condensed into something like `unless debian::codename::eq('trixie') and $haproxy_version == 'haproxy30'`?" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[15:01:32] <wikibugs>	 (03CR) 10BCornwall: "Marking unresolved" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[15:01:55] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk1003.eqiad.wmnet
[15:02:13] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11728021 (10RobH) Please note the ticket was opened but their portal doesn't seem to email myself, Arzhel, or Cathal even though I listed all three of us on the tic...
[15:03:45] <jinxer-wm>	 FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[15:05:31] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk1003.eqiad.wmnet
[15:06:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh4004.wikimedia.org with OS bookworm
[15:06:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh4004.wikimedia.org
[15:06:53] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5025.eqsin.wmnet with OS trixie
[15:07:04] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS trixie
[15:07:30] <wikibugs>	 (03CR) 10Phuedx: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack)
[15:08:00] <phuedx>	 I'm looking to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TestKitchen/+/1255747. It should fix the large volume of validation errors on the mediawiki.api-request event stream
[15:08:11] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kafka-mirrormaker: enable JMX metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255755 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[15:08:41] <phuedx>	 andre, brennen: Would an out of band deployment disrupt you?
[15:08:44] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11728044 (10RobH) Summary: * EdgeUno says they see no errors only our flap * Arzhel replied back stating that we are still seeing errors, stressed that we've alread...
[15:08:45] <jinxer-wm>	 FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[15:09:02] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:09:06] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:09:07] <andre>	 phuedx: train has reached its final destination and things are calm
[15:09:30] <andre>	 in general, see https://versions.toolforge.org/ for a quick versions check :)
[15:09:42] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:10:17] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:10:36] <phuedx>	 andre: Thanks. OK. Starting
[15:10:56] <logmsgbot>	 !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:10:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/TestKitchen] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255747 (owner: 10Phuedx)
[15:11:01] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:12:30] <wikibugs>	 (03CR) 10Dzahn: "does this apply to a service on the aux cluster though?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[15:12:35] <wikibugs>	 (03Merged) 10jenkins-bot: Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a [extensions/TestKitchen] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255747 (owner: 10Phuedx)
[15:12:57] <logmsgbot>	 !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1255747|Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a]]
[15:13:45] <jinxer-wm>	 RESOLVED: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[15:14:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4003.wikimedia.org
[15:14:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[15:14:46] <wikibugs>	 (03PS5) 10Fabfur: profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825)
[15:14:48] <logmsgbot>	 !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1255747|Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:14:55] <wikibugs>	 (03CR) 10Fabfur: "Probably yes, I was thinking about supporting also future versions but let's start easy and do this way instead!" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[15:15:18] <logmsgbot>	 !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:15:22] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:15:42] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:15:46] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[15:16:42] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[15:16:46] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:17:02] <wikibugs>	 (03PS2) 10Cathal Mooney: Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088)
[15:17:37] <wikibugs>	 (03PS3) 10Cathal Mooney: Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088)
[15:17:39] <logmsgbot>	 !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:17:43] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:18:40] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:18:40] <phuedx>	 Quick check on enwiki main page looks good and the logs look clean (no warnings or errors in Logstash)
[15:18:44] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[15:18:49] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[15:18:58] <logmsgbot>	 !log phuedx@deploy2002 phuedx: Continuing with sync
[15:19:38] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[15:19:52] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[15:19:53] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:19:57] <logmsgbot>	 jmm@cumin2002 makevm (PID 4091383) is awaiting input
[15:21:44] <logmsgbot>	 !log jayme@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[15:21:48] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:21:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Make doh4003/doh4004 new wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255738 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[15:21:57] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:22:00] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[15:22:17] <logmsgbot>	 !log jayme@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[15:22:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4003.wikimedia.org - jmm@cumin2002"
[15:22:52] <logmsgbot>	 !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255747|Hooks: Re-apply I52fc151ab88d79754baeff35d2c0f200ebe9fc9a]] (duration: 09m 55s)
[15:24:55] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 222.72 ms
[15:25:10] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[15:25:24] <phuedx>	 Monitoring logs
[15:25:49] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet
[15:25:54] <logmsgbot>	 jmm@cumin2002 makevm (PID 4091383) is awaiting input
[15:26:17] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet
[15:28:56] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief1002.eqiad.wmnet
[15:29:28] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet
[15:30:01] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:30:07] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet
[15:31:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4003.wikimedia.org - jmm@cumin2002"
[15:31:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:31:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy4003.wikimedia.org on all recursors
[15:31:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy4003.wikimedia.org on all recursors
[15:31:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4003.wikimedia.org - jmm@cumin2002"
[15:31:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4003.wikimedia.org - jmm@cumin2002"
[15:32:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4003.wikimedia.org with OS bookworm
[15:32:41] <wikibugs>	 (03CR) 10Ssingh: "(Still on the list to review, not forgotten about this.)" [puppet] - 10https://gerrit.wikimedia.org/r/1250626 (owner: 10Majavah)
[15:32:47] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1002.eqiad.wmnet
[15:32:54] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host acmechief2002.codfw.wmnet
[15:33:06] <wikibugs>	 (03CR) 10BBlack: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack)
[15:33:43] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s3 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:33:43] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:34:29] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5026.eqsin.wmnet with OS trixie
[15:34:39] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 550.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:34:46] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5025.eqsin.wmnet with OS trixie
[15:34:52] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS trixie
[15:35:06] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5025.eqsin.wmnet with OS trixie
[15:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:35:43] <wikibugs>	 (03PS4) 10Ssingh: Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[15:35:57] <phuedx>	 Logs look clean and the validation errors have disappeared 👍
[15:36:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[15:36:44] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2002.codfw.wmnet
[15:36:55] <wikibugs>	 (03PS5) 10Ssingh: Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[15:40:04] <wikibugs>	 (03PS3) 10Dzahn: jenkins: allow rsyncing of data for migrating a jenkins server [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521)
[15:42:44] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 9 DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[15:43:44] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] profile::haproxy: ability to use custom component on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1255745 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur)
[15:43:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11728322 (10Aklapper) @CorinnaHillebrand_WMDE: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/external/), so your 'LDAP User' accoun...
[15:47:11] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1255136/8308/" [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[15:47:45] <wikibugs>	 (03PS1) 10Milimetric: testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050)
[15:48:09] <wikibugs>	 (03CR) 10Phuedx: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack)
[15:48:18] <icinga-wm>	 PROBLEM - Host ms-backup2003 is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:18] <icinga-wm>	 PROBLEM - Host ms-backup2004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:50] <icinga-wm>	 RECOVERY - Host ms-backup2003 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms
[15:49:08] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "@mmuhlenhoff@wikimedia.org: I rebased this on master and wanted to merge this today. Can you please quickly review it again? Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[15:49:22] <icinga-wm>	 RECOVERY - Host ms-backup2004 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms
[15:51:21] <wikibugs>	 (03CR) 10TChin: [C:03+1] testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric)
[15:51:33] <wikibugs>	 (03PS1) 10Dzahn: Revert "jenkins: define contint1003 as the manager_host for the jenkins role" [puppet] - 10https://gerrit.wikimedia.org/r/1255764
[15:52:16] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] testKitchen: Add custom stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255763 (https://phabricator.wikimedia.org/T417050) (owner: 10Milimetric)
[15:53:10] <wikibugs>	 (03PS1) 10Jdlrobson: Implement addListener fallback for older browsers in matchMedia [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255765 (https://phabricator.wikimedia.org/T419717)
[15:53:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4003.wikimedia.org with reason: host reimage
[15:54:03] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "By quickly I mean not urgently but that it is a quick review 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[15:56:30] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:57:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "jenkins: define contint1003 as the manager_host for the jenkins role" [puppet] - 10https://gerrit.wikimedia.org/r/1255764 (owner: 10Dzahn)
[15:59:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4003.wikimedia.org with reason: host reimage
[15:59:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[16:00:05] <jouncebot>	 jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1600).
[16:00:05] <jouncebot>	 phuedx and Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make doh4003/doh4004 new wikidough nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255738 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[16:00:12] <Dreamy_Jazz>	 \o
[16:00:17] <phuedx>	 o/
[16:01:32] <rzl>	 o/ hi, looking
[16:02:30] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2001.codfw.wmnet
[16:03:12] <wikibugs>	 (03PS1) 10Brouberol: kafka-mirrormaker: update base image to include prometheus-jmx-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255767 (https://phabricator.wikimedia.org/T417407)
[16:05:01] <logmsgbot>	 !log brouberol@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[16:05:11] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5025.eqsin.wmnet with reason: host reimage
[16:05:16] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mw::maintenance: Remove ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1249932 (https://phabricator.wikimedia.org/T419428) (owner: 10Phuedx)
[16:05:56] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage
[16:06:30] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2001.codfw.wmnet
[16:06:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:06:45] <rzl>	 Dreamy_Jazz: for https://gerrit.wikimedia.org/r/1255694, are you able to get a review from someone familiar with the subject matter?
[16:06:59] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1142.eqiad.wmnet
[16:07:08] <logmsgbot>	 !log brouberol@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[16:07:15] <Dreamy_Jazz>	 I can ask someone from my team to give a +1 if needed
[16:07:21] <rzl>	 I can review that this will indeed do *something* every day at midnight, but not that it'll do the right thing :)
[16:07:34] <rzl>	 yeah, that'd be appreciated -- if it takes longer than the puppet window just ping me, happy to still do it
[16:08:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[16:08:29] <Dreamy_Jazz>	 Pinged my team
[16:08:30] <rzl>	 assume you'd like me to go ahead with https://gerrit.wikimedia.org/r/1255687 in the meantime though?
[16:08:37] <Dreamy_Jazz>	 Yeah, these are independent changes
[16:08:41] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:08:41] <rzl>	 👍
[16:08:49] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mw::maintenance: Purge blocks on closed but not preinstall wikis [puppet] - 10https://gerrit.wikimedia.org/r/1255687 (https://phabricator.wikimedia.org/T420571) (owner: 10Dreamy Jazz)
[16:09:18] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5025.eqsin.wmnet with reason: host reimage
[16:09:23] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] mw::maintenance: Run purgeRecentChanges.php on wikis without CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz)
[16:09:50] <rzl>	 that was quick, thanks :) will go ahead
[16:09:54] <Dreamy_Jazz>	 :D
[16:10:03] <Dreamy_Jazz>	 That's what pings on Slack get you :D
[16:10:10] <rzl>	 would that it were always so
[16:10:22] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mw::maintenance: Run purgeRecentChanges.php on wikis without CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/1255694 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz)
[16:10:44] <rzl>	 merging these all at once, then we can wait for puppet on the deploy host only a single time
[16:10:45] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2002.codfw.wmnet
[16:10:56] <rzl>	 will you want me to manually kick off a test run, or just wait until they fire naturally?
[16:11:10] <Dreamy_Jazz>	 For mine, can wait till they fire naturally
[16:11:19] <rzl>	 cool, phuedx's is a no-op
[16:11:28] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: enable JMX metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255755 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[16:11:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: update base image to include prometheus-jmx-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255767 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[16:11:36] <rzl>	 (meant to say, thanks phuedx for the cleanup <3 easy to forget those)
[16:11:45] <phuedx>	 No worries!
[16:12:06] <zabe>	 I would suggest to remove abstractwiki from the dblists until the addWiki bug is fixed. Our infrastructure does not really support wikis in preinstall without a db as seen.
[16:13:40] <wikibugs>	 (03Merged) 10jenkins-bot: kafka-mirrormaker: enable JMX metrics collection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255755 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[16:13:41] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage
[16:13:42] <wikibugs>	 (03Merged) 10jenkins-bot: kafka-mirrormaker: update base image to include prometheus-jmx-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255767 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[16:14:12] <wikibugs>	 (03PS1) 10Jsn.sherman: Remove local configuration routing and loading [extensions/AutoModerator] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255772 (https://phabricator.wikimedia.org/T419835)
[16:14:18] <wikibugs>	 (03PS1) 10Jforrester: [abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255773 (https://phabricator.wikimedia.org/T420531)
[16:14:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/AutoModerator] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255772 (https://phabricator.wikimedia.org/T419835) (owner: 10Jsn.sherman)
[16:14:38] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2002.codfw.wmnet
[16:14:52] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host flink-zk2003.codfw.wmnet
[16:15:17] <James_F>	 As soon as rzl's puppetting is over I'll try another fix for AW.
[16:15:24] <wikibugs>	 (03PS3) 10BBlack: Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586)
[16:15:38] <rzl>	 James_F: go ahead, I'm still finishing up but nothing that'll conflict
[16:15:46] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1142.eqiad.wmnet
[16:16:05] <James_F>	 Ack.
[16:16:10] <rzl>	 (and see also zabe's comment above, for awareness)
[16:16:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4003.wikimedia.org with OS bookworm
[16:16:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy4003.wikimedia.org
[16:16:20] <wikibugs>	 (03CR) 10BBlack: Fix Wmf-Uniq Server-Timing header format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack)
[16:16:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255773 (https://phabricator.wikimedia.org/T420531) (owner: 10Jforrester)
[16:17:03] <wikibugs>	 (03CR) 10BBlack: [C:03+2] Fix Wmf-Uniq Server-Timing header format [puppet] - 10https://gerrit.wikimedia.org/r/1255744 (https://phabricator.wikimedia.org/T420586) (owner: 10BBlack)
[16:17:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox
[16:17:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors
[16:17:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors
[16:17:33] <wikibugs>	 (03Merged) 10jenkins-bot: [abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255773 (https://phabricator.wikimedia.org/T420531) (owner: 10Jforrester)
[16:17:51] <wikibugs>	 (03PS2) 10Dzahn: ci::jenkins: add firewall rule to allow legacy machines to new jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521)
[16:17:54] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255773|[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki (T420531)]]
[16:17:59] <stashbot>	 T420531: addWiki.php fails with CannotReplaceActiveServiceException for DBLoadBalancerFactory - https://phabricator.wikimedia.org/T420531
[16:18:53] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flink-zk2003.codfw.wmnet
[16:19:47] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255773|[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki (T420531)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:20:06] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[16:20:13] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2041*} and A:cp - 3.2 test upgrade ()
[16:20:13] <logmsgbot>	 !log fabfur@cumin1003 END (FAIL) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=1) rolling upgrade of HAProxy on P{cp2041*} and A:cp - 3.2 test upgrade ()
[16:20:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4004.wikimedia.org
[16:20:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[16:20:55] <wikibugs>	 (03CR) 10Muehlenhoff: Add fundraising-data-uploader role user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[16:21:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors
[16:21:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors
[16:21:58] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:22:02] <wikibugs>	 (03CR) 10Eevans: [C:03+2] charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[16:22:19] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 378.75 ms
[16:23:39] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:aqs-codfw
[16:23:41] <wikibugs>	 (03PS1) 10DCausse: airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091)
[16:23:52] <wikibugs>	 (03Merged) 10jenkins-bot: charts/cassandra-http-gateway: template table configuration for hoarde [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250650 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans)
[16:23:59] <wikibugs>	 (03PS1) 10Jforrester: Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255779
[16:24:00] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255773|[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki (T420531)]] (duration: 06m 06s)
[16:24:04] <stashbot>	 T420531: addWiki.php fails with CannotReplaceActiveServiceException for DBLoadBalancerFactory - https://phabricator.wikimedia.org/T420531
[16:24:37] <rzl>	 Dreamy_Jazz: you're all set
[16:24:38] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255779 (owner: 10Jforrester)
[16:24:39] <wikibugs>	 (03PS3) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948)
[16:24:42] <rzl>	 puppet window complete \o/
[16:24:46] <Dreamy_Jazz>	 Thanks!
[16:24:56] <wikibugs>	 (03CR) 10CDanis: Add fundraising-data-uploader role user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[16:24:58] <logmsgbot>	 jmm@cumin2002 makevm (PID 4108997) is awaiting input
[16:25:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[16:25:42] <James_F>	 Hmmmmm.
[16:25:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255779 (owner: 10Jforrester)
[16:25:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091) (owner: 10DCausse)
[16:26:01] <James_F>	 How do I stop the helpful code redirecting me to incubator for abstract.wikipedia.org now that it exists?
[16:26:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255779 (owner: 10Jforrester)
[16:26:21] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255779|Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki"]]
[16:26:22] <wikibugs>	 (03PS8) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112)
[16:27:07] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service aqs2001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:27:22] <mutante>	 James_F: trying to help you out
[16:27:41] <James_F>	 mutante: Is it just a Varnish cache issue?
[16:28:14] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255779|Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:28:16] <mutante>	 well, I tried to exclude that option and ran a purge cache command
[16:28:22] <mutante>	 but maybe it isnt that
[16:28:56] <James_F>	 The MW-side rewrite code is in multiversion/missing.php
[16:29:12] <taavi>	 it's not a cached redirect, I get that when curling mw-web directly
[16:29:16] <bblack>	 yeah
[16:29:21] <bblack>	 < HTTP/2 302 
[16:29:21] <bblack>	 < date: Thu, 19 Mar 2026 16:28:56 GMT
[16:29:21] <bblack>	 < server: mw-web.codfw.main-5947f4dd7b-h86d8
[16:29:21] <bblack>	 < cache-control: no-cache
[16:29:22] <bblack>	 < location: https://incubator.wikimedia.org/wiki/Wp/abstract?goto=mainpage
[16:29:35] <bblack>	 ^ MW is doing the redirect, it was a miss/pass situation in the cache
[16:29:46] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[16:30:57] <James_F>	 Oh, duh, it's still pre-installed.
[16:31:02] <taavi>	 James_F: abstractwiki is in preinstall.dblist, and multiversion/MWMultiVersion.php line 741 means everything in it gets redirected to incubator
[16:31:05] <taavi>	 .. yes
[16:31:05] <James_F>	 So it's behaving correctly.
[16:31:07] <James_F>	 Yeah.
[16:31:12] <mutante>	 James_F: how did you get around https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1255514  
[16:31:22] <mutante>	 I mean https://phabricator.wikimedia.org/T420531
[16:31:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:32:03] <James_F>	 mutante: Just marked that as Resolved. Amir1 gave the idea of disabling the new extension first, which worked.
[16:32:07] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:32:12] <mutante>	 James_F: cool! ack
[16:32:41] <swfrench-wmf>	 James_F: thank you very much for fixing that!
[16:32:51] <James_F>	 Time to activate the wiki.
[16:33:11] <rzl>	 🎉
[16:33:32] <wikibugs>	 (03PS1) 10Jforrester: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039
[16:33:40] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255779|Revert "[abstractwiki] Temporarily disable wgWikiLambdaEnableAbstractMode to see if this means we can create the wiki"]] (duration: 07m 19s)
[16:33:41] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:33:43] <wikibugs>	 (03PS2) 10Jforrester: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039
[16:33:54] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/WP25EasterEggs] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255765 (https://phabricator.wikimedia.org/T419717) (owner: 10Jdlrobson)
[16:34:21] <logmsgbot>	 jmm@cumin2002 restart-reboot (PID 4108615) is awaiting input
[16:34:24] <wikibugs>	 (03PS3) 10Jforrester: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 (https://phabricator.wikimedia.org/T411723)
[16:34:39] <taavi>	 that commit message ("Activate Abstract Wikipedia") sounds very fancy
[16:34:50] <James_F>	 Doesn't it just?
[16:34:51] <wikibugs>	 (03PS4) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948)
[16:34:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 (https://phabricator.wikimedia.org/T411723) (owner: 10Jforrester)
[16:35:14] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[16:35:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "This will probably help!" [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi)
[16:35:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox
[16:35:47] <wikibugs>	 (03PS3) 10Dzahn: ci::jenkins: add firewall rule to allow legacy machines to new jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521)
[16:35:51] <wikibugs>	 (03Merged) 10jenkins-bot: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 (https://phabricator.wikimedia.org/T411723) (owner: 10Jforrester)
[16:36:02] <Amir1>	 🎉 
[16:36:09] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255039|Activate Abstract Wikipedia (T411723)]]
[16:36:13] <stashbot>	 T411723: Set up abstract.wikipedia.org as a new wiki - https://phabricator.wikimedia.org/T411723
[16:38:05] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255039|Activate Abstract Wikipedia (T411723)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:38:22] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[16:38:41] <mutante>	 wikidata says "this is not a wiki" - guess we can update that in a minute 
[16:39:01] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[16:39:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[16:39:07] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1151.eqiad.wmnet
[16:40:01] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job netbox_global in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:40:51] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5025.eqsin.wmnet with OS trixie
[16:41:56] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-master1004.eqiad.wmnet
[16:41:56] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp5025.eqsin.wmnet with reason: firmware updates
[16:41:58] <mutante>	 James_F: congratulations. https://lists.wikimedia.org/hyperkitty/list/newprojects@lists.wikimedia.org/thread/62EQX4JXMVNTFY6ROXNF2RH2YWEYN3Q3/
[16:42:06] <James_F>	 Thanks!
[16:42:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:42:18] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255039|Activate Abstract Wikipedia (T411723)]] (duration: 06m 09s)
[16:42:22] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:42:23] <stashbot>	 T411723: Set up abstract.wikipedia.org as a new wiki - https://phabricator.wikimedia.org/T411723
[16:42:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4004.wikimedia.org - jmm@cumin2002"
[16:43:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4004.wikimedia.org - jmm@cumin2002"
[16:43:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:43:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy4004.wikimedia.org on all recursors
[16:43:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy4004.wikimedia.org on all recursors
[16:43:41] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job netbox_global in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:43:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4004.wikimedia.org - jmm@cumin2002"
[16:43:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4004.wikimedia.org - jmm@cumin2002"
[16:43:47] <taavi>	 hmm, where's the post-creation work task for abstractwiki?
[16:44:01] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3
[16:44:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11728797 (10KFrancis) Hi all, I have sent the NDA out for signatures.  I'll confirm when it's complete. Thanks!
[16:44:07] <wikibugs>	 (03PS5) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948)
[16:44:11] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[16:44:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4004.wikimedia.org with OS bookworm
[16:44:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[16:45:08] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5026.eqsin.wmnet with OS trixie
[16:45:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:45:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993)
[16:46:12] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1151.eqiad.wmnet
[16:46:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:46:55] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Make hcaptcha-proxy4003/hcaptcha-proxy4004 new hcaptcha-proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/1255786 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[16:47:10] <wikibugs>	 (03PS2) 10DCausse: airflow-search: add secrets for opensearch-semantic-search clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255778 (https://phabricator.wikimedia.org/T414091)
[16:47:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:47:20] <wikibugs>	 (03PS6) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948)
[16:47:24] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[16:48:13] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1255144/8309/" [puppet] - 10https://gerrit.wikimedia.org/r/1255144 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[16:48:30] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1004.eqiad.wmnet
[16:48:41] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 71%, RTA = 4996.32 ms
[16:50:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh4001.wikimedia.org
[16:51:08] <wikibugs>	 (03PS1) 10Brouberol: kafka-mirrormaker: ensure the right prometheus annotations are set on the pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255792 (https://phabricator.wikimedia.org/T417407)
[16:51:56] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Deploy grants for backup1 sections for new mediabackup workers [puppet] - 10https://gerrit.wikimedia.org/r/1255793 (https://phabricator.wikimedia.org/T420464)
[16:52:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5025.eqsin.wmnet
[16:52:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:53:12] <wikibugs>	 (03PS1) 10Jforrester: [abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255794
[16:53:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11728833 (10VRiley-WMF)
[16:53:41] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: ensure the right prometheus annotations are set on the pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255792 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol)
[16:53:45] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 431.07 ms
[16:54:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[16:55:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:55:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:56:43] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[16:57:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[16:57:09] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[16:57:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:57:55] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[16:58:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply
[16:59:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[16:59:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[16:59:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:59:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh4001.wikimedia.org
[16:59:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts doh4002.wikimedia.org
[17:00:05] <jouncebot>	 bd808: Time to snap out of that daydream and deploy Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1700)
[17:00:07] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11728845 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `doh4001.wikimedia.org` - doh4001...
[17:00:10] <swfrench-wmf>	 o/
[17:00:22] <swfrench-wmf>	 I'll be doing a bit of testing in mw-debug during this infra window
[17:00:43] <cdanis>	 👀
[17:01:00] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5025.eqsin.wmnet
[17:03:45] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5025.*
[17:03:55] <wikibugs>	 (03PS1) 10Dzahn: jenkins: pass srange as an array to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1255797 (https://phabricator.wikimedia.org/T418521)
[17:04:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[17:05:06] <bd808>	 I have a developer-portal build to ship in my window today.
[17:05:08] <wikibugs>	 (03PS1) 10Brouberol: kafka-mirrormaker: add the mirror_name pod label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255799 (https://phabricator.wikimedia.org/T417407)
[17:05:10] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:05:13] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp5026.eqsin.wmnet with reason: firmware updates
[17:05:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] jenkins: pass srange as an array to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1255797 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[17:05:40] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump to 2026-03-19-122408-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255800
[17:06:05] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Deploy grants for backup1 sections for new mediabackup workers [puppet] - 10https://gerrit.wikimedia.org/r/1255793 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[17:07:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:07:22] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[17:07:22] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:07:31] <wikibugs>	 (03PS1) 10Ladsgroup: Make the handler follow the thumb steps [extensions/3D] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255801 (https://phabricator.wikimedia.org/T414805)
[17:07:34] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[17:08:24] <wikibugs>	 (03PS26) 10Ryan Kemper: dse-k8s: Auto-set OpenSearch pod readahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[17:08:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[17:08:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4004.wikimedia.org with reason: host reimage
[17:08:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[17:08:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:08:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh4002.wikimedia.org
[17:08:58] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11728907 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `doh4002.wikimedia.org` - doh4002...
[17:10:26] <wikibugs>	 10ops-codfw, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420613 (10phaultfinder) 03NEW
[17:10:49] <taavi>	 James_F: I think abstractwiki is missing the usual post-creation work task?
[17:11:04] <James_F>	 If you show me a template I'll make such a task.
[17:11:31] <taavi>	 usually the bot makes those, but T404567
[17:11:31] <stashbot>	 T404567: Post-creation work for tokwiki - https://phabricator.wikimedia.org/T404567
[17:11:39] <James_F>	 There's a bot?
[17:11:47] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Pool new worker hosts ms-backup1003 & ms-backup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464)
[17:11:49] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464)
[17:12:03] <taavi>	 yes, one of the things https://phabricator.wikimedia.org/p/Maintenance_bot/ does
[17:12:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:12:12] <wikibugs>	 (03PS2) 10Jcrespo: mediabackups: Pool new worker hosts ms-backup1003 & ms-backup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464)
[17:12:15] <James_F>	 Fancy.
[17:12:37] <wikibugs>	 (03PS3) 10Jcrespo: mediabackups: Pool new worker hosts ms-backup1003 & ms-backup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464)
[17:12:38] <James_F>	 Surely we're not still doing RESTbase crap?
[17:12:42] <James_F>	 Oy veh.
[17:12:44] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[17:12:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackups: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[17:13:33] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464)
[17:13:37] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[17:14:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4004.wikimedia.org with reason: host reimage
[17:14:05] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump to 2026-03-19-122408-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255800 (owner: 10BryanDavis)
[17:15:29] <wikibugs>	 (03PS1) 10Dzahn: Revert^2 "jenkins: define contint1003 as the manager_host for the jenkins role" [puppet] - 10https://gerrit.wikimedia.org/r/1255808
[17:15:38] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5026.eqsin.wmnet
[17:15:45] <wikibugs>	 (03PS3) 10Jcrespo: mediabackup: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464)
[17:15:48] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[17:16:09] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump to 2026-03-19-122408-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255800 (owner: 10BryanDavis)
[17:16:54] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert^2 "jenkins: define contint1003 as the manager_host for the jenkins role" [puppet] - 10https://gerrit.wikimedia.org/r/1255808 (owner: 10Dzahn)
[17:17:55] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:18:42] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:19:12] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackups: Pool new worker hosts ms-backup1003 & ms-backup1004 [puppet] - 10https://gerrit.wikimedia.org/r/1255804 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[17:19:37] <Amir1>	 jouncebot: nowandnext
[17:19:37] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1700)
[17:19:37] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1700)
[17:19:38] <jouncebot>	 In 0 hour(s) and 40 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1800)
[17:20:21] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "Okay, I cleaned up the commit message; I can't identify any further issues with this patch, and in any case we should get this merged and " [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[17:21:54] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:22:01] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:22:20] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:22:35] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:22:55] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:22:57] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:24:30] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5026.eqsin.wmnet
[17:24:38] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Pool new worker hosts ms-backup2003 & ms-backup2004 [puppet] - 10https://gerrit.wikimedia.org/r/1255805 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[17:26:42] <logmsgbot>	 !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on contint1003.wikimedia.org with reason: jenkins on java21
[17:28:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:28:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:28:22] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] "Yes, os-reports is listed a few lines above." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255514 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[17:29:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4004.wikimedia.org with OS bookworm
[17:29:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy4004.wikimedia.org
[17:30:36] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5026.*
[17:32:45] <wikibugs>	 (03PS3) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028)
[17:33:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:33:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:34:42] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11729033 (10herron) We chatted about this a bit at the o11y team meeting this week and consensus was that we're looking ok capacity wise, but would like to explore the potential...
[17:36:06] <wikibugs>	 (03PS4) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028)
[17:38:10] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs2005-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:38:10] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service aqs2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:39:11] <wikibugs>	 (03PS5) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028)
[17:39:43] <wikibugs>	 (03PS1) 10AOkoth: ats: add wmf-navigator entry [puppet] - 10https://gerrit.wikimedia.org/r/1255818 (https://phabricator.wikimedia.org/T414405)
[17:43:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:43:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2006-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:43:41] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11729059 (10Krinkle) For posterity, from [Grafana: Swift dashboard (Krinkle copy)](https://grafana-rw.wikimedia.org/d/75a174f3-44b6-4416-a8b8-201ad5a0c09f/swift-krinkle-copy):  {F7315...
[17:44:44] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet
[17:44:46] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[17:45:01] <logmsgbot>	 !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs1020.eqiad.wmnet
[17:46:15] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[17:46:25] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[17:47:35] <icinga-wm>	 PROBLEM - Host logstash2033 is DOWN: PING CRITICAL - Packet loss = 100%
[17:49:35] <icinga-wm>	 RECOVERY - Host logstash2033 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms
[17:51:34] <wikibugs>	 (03PS1) 10Jforrester: SpecialAbstractContent: Fix hard-coded policy list page namespace [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255820
[17:53:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:53:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:53:49] <icinga-wm>	 PROBLEM - Host logstash2034 is DOWN: PING CRITICAL - Packet loss = 100%
[17:54:35] <icinga-wm>	 RECOVERY - Host logstash2034 is UP: PING OK - Packet loss = 0%, RTA = 30.60 ms
[17:55:53] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1020.eqiad.wmnet
[17:55:58] <swfrench-wmf>	 FYI, I'm done with my testing for this window
[17:56:29] <James_F>	 Is there any way to stop MWMultiVersion "cleverly" merging extension-default config instead of over-writing it?
[17:58:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:58:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2007-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:58:45] <icinga-wm>	 PROBLEM - Host logstash2036 is DOWN: PING CRITICAL - Packet loss = 100%
[17:59:15] <icinga-wm>	 RECOVERY - Host logstash2036 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms
[18:00:05] <jouncebot>	 andre and brennen: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1800).
[18:00:12] <andre>	 nah
[18:01:55] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:02:36] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1020.eqiad.wmnet
[18:03:41] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:03:55] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2008-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:04:45] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11729162 (10herron)
[18:04:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial - https://phabricator.wikimedia.org/T420623 (10RobH) 03NEW
[18:06:05] <icinga-wm>	 PROBLEM - Host logstash2037 is DOWN: PING CRITICAL - Packet loss = 100%
[18:06:23] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11729180 (10herron)
[18:06:35] <icinga-wm>	 RECOVERY - Host logstash2037 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms
[18:08:55] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2008-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:10:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11729203 (10RobH)
[18:12:32] <wikibugs>	 (03PS6) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028)
[18:12:36] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:12:47] <icinga-wm>	 PROBLEM - Host logstash2035 is DOWN: PING CRITICAL - Packet loss = 100%
[18:12:50] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11729209 (10herron) p:05Triage→03Medium
[18:13:46] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11729213 (10herron) 05Open→03Resolved a:03herron The sloth onboarding backlog is empty!
[18:14:15] <icinga-wm>	 RECOVERY - Host logstash2035 is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms
[18:14:37] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs2009-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:15:43] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:16:12] <wikibugs>	 (03PS1) 10Jforrester: RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255824 (https://phabricator.wikimedia.org/T420617)
[18:16:24] <James_F>	 jouncebot: nowandnext
[18:16:24] <jouncebot>	 For the next 1 hour(s) and 43 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T1800)
[18:16:24] <jouncebot>	 In 1 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T2000)
[18:16:29] <James_F>	 OK, will deploy.
[18:18:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255824 (https://phabricator.wikimedia.org/T420617) (owner: 10Jforrester)
[18:18:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255820 (owner: 10Jforrester)
[18:18:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255794 (owner: 10Jforrester)
[18:19:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2009-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:19:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:19:55] <wikibugs>	 (03Merged) 10jenkins-bot: [abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255794 (owner: 10Jforrester)
[18:22:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[18:23:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255824 (https://phabricator.wikimedia.org/T420617) (owner: 10Jforrester)
[18:23:13] <sukhe>	 brett: reboot?
[18:23:23] <icinga-wm>	 PROBLEM - pybal on lvs1020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[18:23:39] <wikibugs>	 (03Merged) 10jenkins-bot: RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255824 (https://phabricator.wikimedia.org/T420617) (owner: 10Jforrester)
[18:24:02] <brett>	 sukhe: Yeah, but it shouldn't be lvs1020 - cjd91 did you do lvs1020?
[18:24:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255820 (owner: 10Jforrester)
[18:24:19] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialAbstractContent: Fix hard-coded policy list page namespace [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255820 (owner: 10Jforrester)
[18:24:24] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=112) https://wikitech.wikimedia.org/wiki/PyBal
[18:24:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:24:32] <cjd91>	 yeah. sorry, I'll fix it
[18:24:39] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255824|RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 (T420617)]], [[gerrit:1255820|SpecialAbstractContent: Fix hard-coded policy list page namespace]], [[gerrit:1255794|[abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT]]
[18:24:44] <stashbot>	 T420617: RecentChanges on Abstract Wikipedia links to users are wrong - https://phabricator.wikimedia.org/T420617
[18:25:24] <icinga-wm>	 RECOVERY - pybal on lvs1020 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[18:25:39] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Apply final role for eqiad mediabackup new storages [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506)
[18:25:42] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:25:55] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service aqs2010-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:25:59] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo)
[18:26:38] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255824|RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 (T420617)]], [[gerrit:1255820|SpecialAbstractContent: Fix hard-coded policy list page namespace]], [[gerrit:1255794|[abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now b
[18:26:38] <logmsgbot>	 e verified there.
[18:27:02] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[18:29:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:29:15] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Apply final role for eqiad mediabackup new storages [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506)
[18:29:24] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 112 connections established with conf1007.eqiad.wmnet:4001 (min=112) https://wikitech.wikimedia.org/wiki/PyBal
[18:29:51] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo)
[18:30:55] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:30:59] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255824|RepoBooks::onMediaWikiServices: Skip all low NSes, not just NS0 (T420617)]], [[gerrit:1255820|SpecialAbstractContent: Fix hard-coded policy list page namespace]], [[gerrit:1255794|[abstractwiki] Allow "Abstract:" as well as "Abstract Wikipedia:" as NS_PROJECT]] (duration: 06m 20s)
[18:31:04] <stashbot>	 T420617: RecentChanges on Abstract Wikipedia links to users are wrong - https://phabricator.wikimedia.org/T420617
[18:32:16] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[18:32:26] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[18:32:32] <icinga-wm>	 PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[18:32:43] <brett>	 ^known
[18:32:52] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal
[18:34:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:34:36] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Apply final role for eqiad mediabackup new storages [puppet] - 10https://gerrit.wikimedia.org/r/1255828 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo)
[18:35:47] <brett>	 cjd91: Did you run the downtime cookbook?
[18:38:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:39:55] <topranks>	 brett: everything ok with the lvs?  I was gonna make some addiitons to vlans in eqiad row D, won't affect that but if there is some connectivity problem or incident I'll hold off just in case 
[18:40:10] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service aqs2011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:40:30] <jynus>	 also holding further deployments just in case
[18:41:36] <brett>	 topranks: There's no incident, no - there's reboots going on
[18:41:46] <brett>	 cjd91, can you run the downtime cookbook?
[18:41:49] <topranks>	 ok cool I'll proceed in that case 
[18:41:57] <topranks>	 brett: just being cautious thanks! 
[18:42:01] <brett>	 thanks for checking!
[18:42:07] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:aqs-codfw
[18:42:09] <jynus>	 I will wait a bit for alerts to clear, I need visibility in case I myself throw errors
[18:43:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:44:12] <logmsgbot>	 !log cdobbins@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs1019.eqiad.wmnet with reason: planned reboot
[18:44:20] <brett>	 nice, thanks!
[18:44:41] <cjd91>	 sorry about the delay
[18:45:10] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service aqs2011-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:46:56] <wikibugs>	 10ops-eqiad, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T420634 (10phaultfinder) 03NEW
[18:49:08] <topranks>	 !log add vlan sub-interface for analytics1-d-eqiad vlan to leaf switches in eqiad row d T405562 
[18:50:00] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase-codfw
[18:50:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11729465 (10Jclark-ctr) a:03Jclark-ctr
[18:50:55] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet
[18:53:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11729490 (10Jclark-ctr) 05Open→03Resolved
[18:53:04] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 6 hosts with reason: kernel module reload
[18:54:14] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet
[18:54:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[18:54:33] <icinga-wm>	 PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[18:55:52] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service aqs2012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:55:53] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[18:57:36] <icinga-wm>	 RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[18:57:52] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 82 connections established with conf1007.eqiad.wmnet:4001 (min=82) https://wikitech.wikimedia.org/wiki/PyBal
[19:00:16] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add analytic vlan hostnames - cmooney@cumin1003"
[19:00:20] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add analytic vlan hostnames - cmooney@cumin1003"
[19:00:20] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:00:52] <jinxer-wm>	 RESOLVED: [10x] ProbeDown: Service aqs2012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:01:42] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11729624 (10thcipriani) > I need this in order to be able to access live db for query optimization while writing new code  Do you mean for performance metrics and maintenance scripts? O...
[19:01:54] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[19:02:16] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[19:04:07] <topranks>	 !log disable IPv6 router-advertisements on eqiad core routers for analytics1-d-eqiad vlan T405562
[19:04:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:06:44] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:09:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:09:45] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T420645 (10phaultfinder) 03NEW
[19:11:44] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2024-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:13:05] <wikibugs>	 (03PS1) 10Cathal Mooney: analytics1-d-eqiad vlan: cease sending RAs on CRs and DHCP relay [homer/public] - 10https://gerrit.wikimedia.org/r/1255835 (https://phabricator.wikimedia.org/T405562)
[19:13:21] <wikibugs>	 (03PS6) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028)
[19:14:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:16:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:17:10] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2024-b:9042 has failed probes (tcp_cassandra_b_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:19:04] <wikibugs>	 (03PS7) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028)
[19:21:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:21:44] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:22:10] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:26:44] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2025-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:28:41] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:30:01] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:30:02] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:30:08] <sukhe>	 hmm yeah
[19:30:14] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:30:32] <icinga-wm>	 ACKNOWLEDGEMENT - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git Sukhbir Singh gerrit down https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:30:33] <icinga-wm>	 ACKNOWLEDGEMENT - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git Sukhbir Singh gerrit down https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:31:44] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2026-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:32:55] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2026-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:33:41] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:34:02] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:34:02] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:34:19] <wikibugs>	 (03PS8) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028)
[19:34:22] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[19:34:45] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[19:35:02] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:35:14] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:35:36] <logmsgbot>	 !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1018.eqiad.wmnet with reason: reboots
[19:35:58] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] "@Arzhel I messed up here, meant to self-merage a different patch.  Let me know what you think here I'm happy to revise." [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[19:36:02] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:36:02] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:36:31] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia: Manually configure the MAC address for anycast gateway ints [homer/public] - 10https://gerrit.wikimedia.org/r/1255749 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[19:36:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:36:44] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2026-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:36:57] <brett>	 !log stopping pybal/puppet on lvs1018 for reboots
[19:36:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:08] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] mediabackups: Open s3 storage port on storage hosts from working hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[19:37:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-esams and cr1-eqiad (185.15.59.144) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr1-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:38:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] analytics1-d-eqiad vlan: cease sending RAs on CRs and DHCP relay [homer/public] - 10https://gerrit.wikimedia.org/r/1255835 (https://phabricator.wikimedia.org/T405562) (owner: 10Cathal Mooney)
[19:38:29] <wikibugs>	 (03PS1) 10Catrope: testwiki: Add temporary groups for security testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255847
[19:38:36] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[19:39:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:41:07] <wikibugs>	 (03Merged) 10jenkins-bot: analytics1-d-eqiad vlan: cease sending RAs on CRs and DHCP relay [homer/public] - 10https://gerrit.wikimedia.org/r/1255835 (https://phabricator.wikimedia.org/T405562) (owner: 10Cathal Mooney)
[19:41:44] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service restbase2026-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:42:50] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr2-esams and cr1-eqiad (185.15.59.144) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr1-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[19:44:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:44:48] <topranks>	 !log disable IPv6 VRRP for et-1/0/5.1023 sub-interfaces on eqiad core routers T405562
[19:44:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:53] <stashbot>	 T405562: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562
[19:45:51] <wikibugs>	 (03PS8) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028)
[19:46:40] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:46:42] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:46:44] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2027-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:47:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[19:48:27] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[19:49:00] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:51:13] <mutante>	 ^ due to gerrit restart.. but should resolve
[19:51:23] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 7 hosts with reason: kernel module reload
[19:52:36] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[19:53:44] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache 4.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa. on all recursors
[19:53:47] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 4.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa. on all recursors
[19:54:01] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:54:29] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:54:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:55:18] <logmsgbot>	 !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[19:56:24] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet
[19:56:30] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[19:56:39] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:56:43] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[19:59:29] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:59:30] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet
[19:59:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:59:35] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[19:59:46] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T2000).
[20:00:06] <jouncebot>	 arlolra, katherine_g, hector-arroyo, JSherman, and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:14] <icinga-wm>	 PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[20:00:40] <arlolra>	 o/
[20:00:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:00:48] <JSherman>	 o/
[20:01:08] <katherine_g>	 o/
[20:01:14] <icinga-wm>	 RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[20:01:15] <jynus>	 hey, there is some unstability on gerrit
[20:01:18] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add analytic vlan hostnames - cmooney@cumin1003"
[20:01:23] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add analytic vlan hostnames - cmooney@cumin1003"
[20:01:23] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:01:41] <jynus>	 repos may provide errors at the moment
[20:01:44] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:02:12] <jynus>	 if there is enough time, you may want to wait a bit for deployment
[20:02:26] <brett>	 lvs errors are fine to be ignored, icinga is removing downtimes after reboots when it shouldn
[20:02:28] <brett>	 shouldn't
[20:03:20] <jynus>	 no, gerrit errors
[20:03:24] <rzl>	 (for clarity that's two separate things -- LVS errors are ignorable, gerrit recovery is still in progress)
[20:03:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:03:35] <brett>	 er, yeah, thanks r :)
[20:03:47] <katherine_g>	 ok I can hold off
[20:04:00] <arlolra>	 I can wait
[20:06:44] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:08:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:09:55] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:10:17] <wikibugs>	 (03PS2) 10Catrope: testwiki: Add temporary groups for security testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255847
[20:11:41] <JSherman>	 rzl: will there be a clear go ahead signal when we're good to go?
[20:11:51] <logmsgbot>	 !log cdobbins@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs1016.eqiad.wmnet with reason: reboot
[20:12:07] <jynus>	 JSherman: it seems things are better now, but waiting for some time to confirm it is ok
[20:14:55] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:15:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:18:23] <cjd91>	 I've disabled pybal on lvs1016. it shouldn't take longer than 15-20 minutes
[20:19:35] <JSherman>	 katherine_g: it sounds like we can probably get started then?
[20:19:55] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:19:58] <katherine_g>	 JSherman: ok getting started then 
[20:20:16] <arlolra>	 um, should I start?
[20:20:45] <katherine_g>	 arlolra: yeah sorry! 
[20:20:46] <JSherman>	 arlolra: not trying to line jump! sorry!
[20:20:53] <arlolra>	 :)
[20:21:13] <arlolra>	 if you're confident, you can deploy both config changes at once
[20:21:15] <arlolra>	 or I can
[20:21:44] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:22:28] <katherine_g>	 arlolra: I can do both at once if that's ok
[20:22:34] <arlolra>	 thanks
[20:23:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) (owner: 10Kgraessle)
[20:23:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kgraessle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) (owner: 10Arlolra)
[20:24:44] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Extension:PersonalDashboard to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) (owner: 10Kgraessle)
[20:25:22] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy PRV to 13 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1253654 (https://phabricator.wikimedia.org/T420273) (owner: 10Arlolra)
[20:25:40] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet
[20:25:44] <logmsgbot>	 !log kgraessle@deploy2002 Started scap sync-world: Backport for [[gerrit:1254865|Deploy Extension:PersonalDashboard to English Wikipedia (T418367)]], [[gerrit:1253654|Deploy PRV to 13 wikis (T420273)]]
[20:25:52] <stashbot>	 T418367: Deploy Extension:PersonalDashboard to English Wikipedia - https://phabricator.wikimedia.org/T418367
[20:25:52] <stashbot>	 T420273: Parsoid Read Views to deploy ~2026-03-19 - https://phabricator.wikimedia.org/T420273
[20:26:44] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:27:41] <logmsgbot>	 !log kgraessle@deploy2002 kgraessle, arlolra: Backport for [[gerrit:1254865|Deploy Extension:PersonalDashboard to English Wikipedia (T418367)]], [[gerrit:1253654|Deploy PRV to 13 wikis (T420273)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:27:59] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet
[20:28:34] <icinga-wm>	 PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[20:28:46] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[20:29:09] <katherine_g>	 arlolra: synced to test servers 
[20:29:18] <arlolra>	 looking
[20:29:48] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:29:55] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:30:08] <arlolra>	 katherine_g: lgtm
[20:30:34] <icinga-wm>	 RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[20:32:40] <logmsgbot>	 !log kgraessle@deploy2002 kgraessle, arlolra: Continuing with sync
[20:34:55] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:36:44] <logmsgbot>	 !log kgraessle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254865|Deploy Extension:PersonalDashboard to English Wikipedia (T418367)]], [[gerrit:1253654|Deploy PRV to 13 wikis (T420273)]] (duration: 11m 00s)
[20:36:51] <stashbot>	 T418367: Deploy Extension:PersonalDashboard to English Wikipedia - https://phabricator.wikimedia.org/T418367
[20:36:51] <stashbot>	 T420273: Parsoid Read Views to deploy ~2026-03-19 - https://phabricator.wikimedia.org/T420273
[20:37:18] <katherine_g>	 hector-arroyo: we're done, over to you
[20:37:50] <arlolra>	 katherine_g: thank you
[20:37:58] <katherine_g>	 arlolra: np
[20:39:55] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:40:11] <hector-arroyo>	 when I try to deploy the change clicking on the spiderpig link I get an access denied error, I think I will need help with this
[20:43:41] <JSherman>	 hector-arroyo: I'll have a look
[20:44:07] <arlolra>	 is it because of gerrit instability?
[20:44:25] <JSherman>	 gerrit is being slow again
[20:44:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:44:40] <JSherman>	 ^ there it is
[20:44:53] <hector-arroyo>	 thanks
[20:45:07] <JSherman>	 so, I think we're stuck
[20:46:44] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service restbase2032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:46:56] <JSherman>	 with ~15 minutes left in the window, I don't think we're getting any more backports out
[20:47:57] <jynus>	 sorry for the gerrit issues
[20:48:33] <jynus>	 people are still working on it
[20:48:36] <JSherman>	 jynus: I know everybody is doing their best!
[20:49:55] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:52:04] <hector-arroyo>	 my change is to test something in testwiki, it's not a big deal if it is deployed next week
[20:54:55] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:55:01] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:58:41] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:59:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T2100)
[21:00:55] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:03:37] <mutante>	 jouncebot: nowandnext
[21:03:37] <jouncebot>	 For the next 0 hour(s) and 56 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260319T2100)
[21:03:37] <jouncebot>	 In 8 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260320T0600)
[21:04:30] <mutante>	 hector-arroyo: it could be done now
[21:04:37] <mutante>	 because the following window is empty
[21:04:50] <mutante>	 gerrit should be doing better
[21:05:25] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:05:55] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:06:15] <Jdlrobson>	 Hey all is it good to deploy? 
[21:06:22] <Jdlrobson>	 mutante: I need the Web Team deployment window for a few things.
[21:06:35] <Jdlrobson>	 I can do other deploys if there are outstanding ones from the backport window.
[21:06:44] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:06:44] <James_F>	 Jdlrobson: Gerrit is down-ish, so no deploys right now I think.
[21:07:09] <James_F>	 Oh, I defer to mutante then.
[21:07:13] <mutante>	 well.. Gerrit should be better now.
[21:07:18] <mutante>	 but that window looked empty
[21:07:29] <mutante>	 and the deployers before missed their window due to the gerrit issue
[21:07:30] <Jdlrobson>	 gotcha. Ok down-ish was the bit I was missing. Do we know when it might be back by? We have a bad bug impacting editors that would be best not to leave over the weekend.
[21:07:48] <Jdlrobson>	 I don't mind doing extra deployments once we're stable
[21:08:19] <mutante>	 Jdlrobson: hmm. do it!
[21:08:36] <Jdlrobson>	 mutante: so we're good with Gerrit? What needs deploying? 
[21:09:09] <Jdlrobson>	 I see katherine_g: arlolra  here but none of the other people with changes to deploy.
[21:09:27] <mutante>	 Jdlrobson: Gerrit should be ok again. 2 people were here but then left by now
[21:09:40] <mutante>	 Jdlrobson: you can do your own change
[21:09:41] <Jdlrobson>	 ok ill start with my user bug if that's okay?
[21:09:49] <mutante>	 yea
[21:10:26] <katherine_g>	 jdlrobson: mine and arlolras changes were already deployed so we're done
[21:10:42] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 11 hosts with reason: kernel module reload
[21:11:03] <Jdlrobson>	 katherine_g: thanks for confirming!
[21:11:06] <Jdlrobson>	 I can ping jason
[21:11:14] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup2020.codfw.wmnet with reason: kernel module reload
[21:15:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:16:10] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:21:10] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:21:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:22:24] <logmsgbot>	 !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1255881|Skins: Address issue with blurry images for large thumbnails (T375981)]]
[21:22:29] <stashbot>	 T375981: Preferences settings for small image size are not being respected for Parsoid Read Views - https://phabricator.wikimedia.org/T375981
[21:22:49] <wfan>	 Hey cstone thanks for the review, do you mind to take a look for the smashpig first then I can do a version update for di and civi
[21:24:17] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1255881|Skins: Address issue with blurry images for large thumbnails (T375981)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:25:32] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Continuing with sync
[21:26:10] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:26:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:29:27] <logmsgbot>	 !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255881|Skins: Address issue with blurry images for large thumbnails (T375981)]] (duration: 07m 03s)
[21:29:32] <stashbot>	 T375981: Preferences settings for small image size are not being respected for Parsoid Read Views - https://phabricator.wikimedia.org/T375981
[21:31:10] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:33:50] <James_F>	 Jdlrobson: Once you're done, I have a deploy.
[21:34:06] <Jdlrobson>	 James_F: sounds good. Just this one. Hopefully wont take long
[21:34:10] <James_F>	 Sure, no worries.
[21:40:12] <swfrench-wmf>	 James_F: when you're done, could you ping me? I have a change I'd like to get (does not require scap, just some helmfile'ing on mw-web)
[21:40:16] <James_F>	 Of course.
[21:41:10] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:46:10] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:46:44] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:48:03] <logmsgbot>	 !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1255765|Implement addListener fallback for older browsers in matchMedia (T419717)]]
[21:48:08] <stashbot>	 T419717: TypeError: mq.addEventListener is not a function. (In 'mq.addEventListener('change',listener)', 'mq.addEventListener' is undefined) - https://phabricator.wikimedia.org/T419717
[21:48:21] <icinga-wm>	 PROBLEM - Host logging-hd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[21:49:49] <icinga-wm>	 RECOVERY - Host logging-hd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms
[21:49:53] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1255765|Implement addListener fallback for older browsers in matchMedia (T419717)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:51:26] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson: Continuing with sync
[21:51:42] <Jdlrobson>	 James_F: syncing now. All yours when done
[21:52:55] <James_F>	 <3
[21:54:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:55:20] <logmsgbot>	 !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255765|Implement addListener fallback for older browsers in matchMedia (T419717)]] (duration: 07m 17s)
[21:55:31] <stashbot>	 T419717: TypeError: mq.addEventListener is not a function. (In 'mq.addEventListener('change',listener)', 'mq.addEventListener' is undefined) - https://phabricator.wikimedia.org/T419717
[21:56:09] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5024.eqsin.wmnet [reason: trixie reimaging]
[21:56:10] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:56:58] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS trixie
[21:57:22] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5019.eqsin.wmnet [reason: trixie reimaging]
[21:57:36] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:restbase-codfw
[21:58:02] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5019.eqsin.wmnet with OS trixie
[22:01:10] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:01:29] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255886|Set WikiLambdaAbstractNamespaces's merge_strategy to provide_default (T420649)]]
[22:01:34] <stashbot>	 T420649: When publishing an Abstract Wikipedia article, it is stored in the wrong Namespace - https://phabricator.wikimedia.org/T420649
[22:03:20] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255886|Set WikiLambdaAbstractNamespaces's merge_strategy to provide_default (T420649)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:04:22] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[22:06:56] <jinxer-wm>	 FIRING: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[22:07:25] <James_F>	 swfrench-wmf: Over to you once this sync completes
[22:07:37] <swfrench-wmf>	 James_F: awesome, thank you!
[22:08:15] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255886|Set WikiLambdaAbstractNamespaces's merge_strategy to provide_default (T420649)]] (duration: 06m 46s)
[22:08:20] <stashbot>	 T420649: When publishing an Abstract Wikipedia article, it is stored in the wrong Namespace - https://phabricator.wikimedia.org/T420649
[22:12:53] <swfrench-wmf>	 FYI, I'll be deploying a change to mw-web shortly
[22:12:58] <swfrench-wmf>	 I'll follow up here when done
[22:16:26] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[22:17:44] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[22:18:19] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[22:19:43] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[22:21:56] <jinxer-wm>	 RESOLVED: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[22:22:05] <swfrench-wmf>	 I am done
[22:23:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:23:56] <jinxer-wm>	 FIRING: MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[22:24:31] <icinga-wm>	 PROBLEM - Host logging-hd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[22:27:01] <icinga-wm>	 RECOVERY - Host logging-hd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms
[22:28:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:37:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.512s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:42:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.235s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:42:45] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.132s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:48:19] <logmsgbot>	 !log zabe@deploy2002 mwscript-k8s job started: foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https  # T420643
[22:48:25] <stashbot>	 T420643: Add Wikidata support for abstractwiki - https://phabricator.wikimedia.org/T420643
[22:52:45] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 828.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:53:56] <jinxer-wm>	 FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[23:18:23] <logmsgbot>	 !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5024.eqsin.wmnet with OS trixie
[23:19:27] <logmsgbot>	 !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5019.eqsin.wmnet with OS trixie
[23:19:34] <JJMC89>	 zabe: sorry if that is my fault - was just trying to fill in the post creation tasks since the bot didn't do it
[23:28:59] <Amir1>	 jouncebot: nowandnext
[23:28:59] <jouncebot>	 No deployments scheduled for the next 6 hour(s) and 31 minute(s)
[23:29:00] <jouncebot>	 In 6 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260320T0600)
[23:33:51] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1255801|Make the handler follow the thumb steps (T414805)]]
[23:33:56] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[23:35:44] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1255801|Make the handler follow the thumb steps (T414805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[23:36:12] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[23:40:06] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255801|Make the handler follow the thumb steps (T414805)]] (duration: 06m 14s)
[23:40:10] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[23:59:58] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5019.eqsin.wmnet with OS trixie