[00:00:25] PROBLEM - Host cirrussearch2089 is DOWN: PING CRITICAL - Packet loss = 100% [00:01:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:05:14] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169737|multiversion: Fix "Class Wikimedia\MWConfig\Exception not found"]] (duration: 21m 59s) [00:08:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170445 [00:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170445 (owner: 10TrainBranchBot) [00:14:10] (03PS13) 10Krinkle: beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [00:24:52] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170445 (owner: 10TrainBranchBot) [00:39:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:40:49] (03PS1) 10Kevin Bazira: ml-services: enable multiprocessing for kowiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170447 (https://phabricator.wikimedia.org/T363336) [00:44:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:46:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170208 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [00:47:33] (03Merged) 10jenkins-bot: beta: Remove routing for *.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170208 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [00:47:53] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1170208|beta: Remove routing for *.beta.wmflabs.org (T289318)]] [00:47:57] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [00:49:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:49:48] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1170208|beta: Remove routing for *.beta.wmflabs.org (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:55:40] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:59:55] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:04:52] !log krinkle@deploy1003 krinkle: Continuing with sync [01:05:55] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:07:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:10:06] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170208|beta: Remove routing for *.beta.wmflabs.org (T289318)]] (duration: 22m 13s) [01:10:10] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [01:10:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:12:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:20:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:22:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:41:43] (03CR) 10Kevin Bazira: "thank you so much for the merge Luca." [alerts] - 10https://gerrit.wikimedia.org/r/1170107 (https://phabricator.wikimedia.org/T399683) (owner: 10Kevin Bazira) [01:43:27] (03CR) 10BCornwall: [C:03+1] "Thanks for cleaning up!" [puppet] - 10https://gerrit.wikimedia.org/r/1170096 (https://phabricator.wikimedia.org/T394072) (owner: 10Muehlenhoff) [01:44:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:49:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:01:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:06:40] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:18:01] 10ops-codfw, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T399916 (10phaultfinder) 03NEW [02:18:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:23:40] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:33:57] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:35:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:35:57] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:40:55] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:45:55] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:47:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:51:55] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:54:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:54:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:56:55] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:57:40] RESOLVED: [6x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:08:57] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:09:57] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:10:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and Orange (193.251.154.145) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:34:08] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/2 (Transit: Orange (LD019029) {#D0072}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:50:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:54:08] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/2 (Transit: Orange (LD019029) {#D0072}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:55:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:00:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:01:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:10:40] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:15:40] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:19:08] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/2 (Transit: Orange (LD019029) {#D0072}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:26:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:31:40] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:34:08] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/2 (Transit: Orange (LD019029) {#D0072}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:40:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and Orange (193.251.154.145) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [04:59:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:04:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:05:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:10:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:21:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:31:55] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:34:53] (03CR) 10Stang: zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [05:35:05] (03PS9) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) [05:36:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:46:51] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:48:49] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:51:37] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170455 [05:51:40] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:52:55] FIRING: [7x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:53:42] (03PS2) 10Stevemunene: dns: Add dse-k8s codfw SRV records [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293) [05:54:37] (03CR) 10Stevemunene: dns: Add dse-k8s codfw SRV records (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [05:55:36] (03PS3) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [05:55:58] (03CR) 10Ryan Kemper: Replace elasticsearch api with python requests (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [05:56:40] RESOLVED: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:59:19] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170457 [06:00:07] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250718T0600) [06:02:40] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:05:03] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [06:07:40] FIRING: [6x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:09:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1229.eqiad.wmnet with reason: Maintenance [06:12:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:12:55] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:17:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:22:05] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [06:22:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:22:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:28:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [06:32:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:37:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:41:35] (03CR) 10Elukey: [C:03+2] "Kevin the changes should be propagated via puppet 30/40 mins after the merge, so in our case we should be good." [alerts] - 10https://gerrit.wikimedia.org/r/1170107 (https://phabricator.wikimedia.org/T399683) (owner: 10Kevin Bazira) [06:47:03] (03CR) 10Elukey: [C:03+1] "LGTM but please validate with your team as well :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170447 (https://phabricator.wikimedia.org/T363336) (owner: 10Kevin Bazira) [06:48:38] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [06:48:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:48:55] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:51:22] (03CR) 10Elukey: [C:03+2] admin_ng: bump memory quota for kartotherian on Wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168840 (owner: 10Elukey) [06:51:37] (03PS2) 10Elukey: services: move kartotherian codfw to the maps-test postgres cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165551 (https://phabricator.wikimedia.org/T381565) [06:52:45] elukey@cumin1003 provision (PID 2204614) is awaiting input [06:53:10] RECOVERY - MariaDB Replica Lag: s2 #page on db1229 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:53:40] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:54:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:57:51] elukey@cumin1003 provision (PID 2204614) is awaiting input [06:58:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:58:55] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250718T0700) [07:01:43] 10ops-codfw, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927 (10Marostegui) 03NEW [07:02:30] 10ops-codfw, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11015887 (10Marostegui) p:05Triage→03Medium [07:02:34] (03CR) 10Elukey: [C:03+2] services: move kartotherian codfw to the maps-test postgres cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165551 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:03:40] RESOLVED: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:04:05] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [07:06:10] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [07:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:10:01] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [07:10:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79353 and previous config saved to /var/cache/conftool/dbconfig/20250718-071014-root.json [07:10:42] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:11:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [07:11:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T399249)', diff saved to https://phabricator.wikimedia.org/P79354 and previous config saved to /var/cache/conftool/dbconfig/20250718-071112-marostegui.json [07:11:17] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:13:55] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:14:40] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:15:42] RESOLVED: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:16:21] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [07:18:55] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:23:21] (03PS1) 10Brouberol: dumpwikibasejson: ensure the dump script exists after any error [dumps] - 10https://gerrit.wikimedia.org/r/1170459 (https://phabricator.wikimedia.org/T399077) [07:25:13] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:25:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79355 and previous config saved to /var/cache/conftool/dbconfig/20250718-072520-root.json [07:25:56] (03PS2) 10Brouberol: dumpwikibase: ensure the dump script exists after any error [dumps] - 10https://gerrit.wikimedia.org/r/1170459 (https://phabricator.wikimedia.org/T399077) [07:27:55] (03PS1) 10Elukey: DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170462 [07:28:09] (03PS3) 10Brouberol: dumpwikibase: ensure the dump script exists after any error [dumps] - 10https://gerrit.wikimedia.org/r/1170459 (https://phabricator.wikimedia.org/T399077) [07:29:20] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [07:29:34] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1012.eqiad.wmnet with OS bookworm [07:30:30] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [07:31:40] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [07:33:13] (03PS2) 10Elukey: DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170462 [07:33:41] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [07:34:08] (03CR) 10Btullis: [C:03+1] dumpwikibase: ensure the dump script exists after any error [dumps] - 10https://gerrit.wikimedia.org/r/1170459 (https://phabricator.wikimedia.org/T399077) (owner: 10Brouberol) [07:34:17] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'. [07:34:31] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [07:34:58] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [07:35:23] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11015926 (10ayounsi) According to https://netbox.wikimedia.org/dcim/devices/?q=es20 es2020 to es2025 are now offline. es2026 to es2034 are almost 5 years old... [07:37:30] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11015929 (10Marostegui) >>! In T399927#11015926, @ayounsi wrote: > According to https://netbox.wikimedia.org/dcim/devices/?q=es20 > es2020 to es2025 are now... [07:37:41] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11015930 (10Marostegui) [07:39:31] (03CR) 10CI reject: [V:04-1] DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170462 (owner: 10Elukey) [07:39:55] (03CR) 10Stevemunene: [C:03+1] "lgtm" [dumps] - 10https://gerrit.wikimedia.org/r/1170459 (https://phabricator.wikimedia.org/T399077) (owner: 10Brouberol) [07:40:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:40:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79356 and previous config saved to /var/cache/conftool/dbconfig/20250718-074026-root.json [07:45:02] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [07:45:08] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [07:45:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:46:43] (03PS1) 10Elukey: DNM - test for ML hosts t Change-Id: I8ff264ae5b395b0147d60015599859769ccfb9bd [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 [07:48:14] (03PS2) 10Elukey: DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 [07:49:20] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [07:49:35] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1012.eqiad.wmnet with OS bookworm [07:50:10] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:50:36] (03PS3) 10Elukey: DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 [07:51:03] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [07:51:53] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:52:38] (03PS3) 10Arthur taylor: Enable wbui2025 mobile user interface on Wikidata Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) [07:52:49] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:55:10] FIRING: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:55:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79357 and previous config saved to /var/cache/conftool/dbconfig/20250718-075532-root.json [07:56:30] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [07:58:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11015965 (10elukey) I have realized that the above DHCP response during UEFI wasn't correct (`/srv/tftpboot/bookworm-installer/pxelinux.0`), and I got why - in the Spic... [07:58:36] (03CR) 10CI reject: [V:04-1] DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 (owner: 10Elukey) [07:59:44] (03CR) 10Brouberol: [C:03+2] dumpwikibase: ensure the dump script exists after any error [dumps] - 10https://gerrit.wikimedia.org/r/1170459 (https://phabricator.wikimedia.org/T399077) (owner: 10Brouberol) [08:00:10] RESOLVED: [5x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:02:00] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:07:46] (03PS1) 10Vgutierrez: acme_chief: Delete empty directories after pruning expired certs [puppet] - 10https://gerrit.wikimedia.org/r/1170497 (https://phabricator.wikimedia.org/T399419) [08:09:16] (03PS2) 10Tiziano Fogli: prom/metamonitor: hide DeadManSwitch alerts in Karma [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003) [08:11:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T399249)', diff saved to https://phabricator.wikimedia.org/P79358 and previous config saved to /var/cache/conftool/dbconfig/20250718-081114-marostegui.json [08:11:19] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:16:03] (03CR) 10Tiziano Fogli: "Patch ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [08:17:09] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11016022 (10fnegri) 05Open→03In progress a:03fnegri Memory usage on cloudcephosd1006 did reset at 18:00 UTC yest... [08:20:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:20:55] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:23:30] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778) (owner: 10Brouberol) [08:24:53] (03CR) 10Brouberol: [C:03+2] site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778) (owner: 10Brouberol) [08:26:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P79359 and previous config saved to /var/cache/conftool/dbconfig/20250718-082621-marostegui.json [08:32:35] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:34:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:34:54] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2205 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1170499 (https://phabricator.wikimedia.org/T399930) [08:35:14] 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence-Automations, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#11016063 (10FCeratto-WMF) [08:35:35] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [08:36:43] (03CR) 10Kevin Bazira: "Great. Thank you for the clarification." [alerts] - 10https://gerrit.wikimedia.org/r/1170107 (https://phabricator.wikimedia.org/T399683) (owner: 10Kevin Bazira) [08:37:08] elukey@cumin1003 provision (PID 2216138) is awaiting input [08:37:13] (03PS1) 10Marostegui: db1189: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170500 (https://phabricator.wikimedia.org/T399548) [08:37:42] (03CR) 10Marostegui: [C:03+2] db1189: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170500 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [08:38:01] (03Abandoned) 10Kevin Bazira: team-ml: use global deploy tag for ORESFetchScoreJobKafkaLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1170109 (https://phabricator.wikimedia.org/T399683) (owner: 10Kevin Bazira) [08:38:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1189.eqiad.wmnet with reason: Maintenance [08:38:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1189 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79360 and previous config saved to /var/cache/conftool/dbconfig/20250718-083831-marostegui.json [08:41:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:41:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:41:30] (03PS1) 10Marostegui: installserver: Do not format es1047 [puppet] - 10https://gerrit.wikimedia.org/r/1170502 [08:41:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P79361 and previous config saved to /var/cache/conftool/dbconfig/20250718-084129-marostegui.json [08:42:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:43:09] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1170502 (owner: 10Marostegui) [08:43:51] (03PS9) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [08:43:51] (03PS4) 10Elukey: DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 [08:44:34] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es1047 [puppet] - 10https://gerrit.wikimedia.org/r/1170502 (owner: 10Marostegui) [08:44:44] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:45:45] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [08:46:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:46:25] (03PS1) 10Marostegui: es1048: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1170503 (https://phabricator.wikimedia.org/T395771) [08:46:30] !log elukey@kafkamon2003:~$ sudo systemctl restart burrow-main-codfw.service [08:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:38] (03PS4) 10Ayounsi: Ganeti Bird BGP [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) [08:47:14] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [08:47:24] (03CR) 10Marostegui: [C:03+2] es1048: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1170503 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [08:48:29] (03CR) 10Jaime Nuche: "Thanks for this Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [08:48:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79362 and previous config saved to /var/cache/conftool/dbconfig/20250718-084853-root.json [08:49:30] (03PS2) 10Arnaudb: miscweb: wikiworkshop use httpd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170464 (https://phabricator.wikimedia.org/T398303) [08:49:30] (03CR) 10Arnaudb: "all tags have been checked and are pullable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170464 (https://phabricator.wikimedia.org/T398303) (owner: 10Arnaudb) [08:49:47] 06SRE, 06Infrastructure-Foundations, 10netops: BGP: Support receipt of graceful-shutdown community and set local-pref - https://phabricator.wikimedia.org/T399931 (10cmooney) 03NEW p:05Triage→03Low [08:50:36] (03CR) 10CI reject: [V:04-1] DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 (owner: 10Elukey) [08:54:01] (03PS1) 10Marostegui: instances.yaml: Add es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1170504 (https://phabricator.wikimedia.org/T395771) [08:54:53] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1170504 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [08:54:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11016109 (10cmooney) All looks clean overnight with this, I have confirmed to Arelion they can close their ticket and we will re-open if the same thing happens ag... [08:55:02] 06SRE, 06Infrastructure-Foundations, 10netops: BGP: Support receipt of graceful-shutdown community and set local-pref - https://phabricator.wikimedia.org/T399931#11016110 (10ayounsi) Makes sens! [08:55:24] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:56:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1048 to es7 depooled T395771', diff saved to https://phabricator.wikimedia.org/P79363 and previous config saved to /var/cache/conftool/dbconfig/20250718-085652-marostegui.json [08:56:57] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [08:57:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T399249)', diff saved to https://phabricator.wikimedia.org/P79364 and previous config saved to /var/cache/conftool/dbconfig/20250718-085704-marostegui.json [08:57:08] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:57:19] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:57:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [08:57:22] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:57:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool es1048 with 1% weight on es7 T395771', diff saved to https://phabricator.wikimedia.org/P79365 and previous config saved to /var/cache/conftool/dbconfig/20250718-085755-marostegui.json [09:03:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79366 and previous config saved to /var/cache/conftool/dbconfig/20250718-090358-root.json [09:04:31] (03PS1) 10Brouberol: deployment_server: group chown airflow-wmde kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1170508 (https://phabricator.wikimedia.org/T399066) [09:04:58] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11016136 (10elukey) @Jhancock.wm I managed to make provision working, the new settings are not yet merged so if you have other similar hosts ping me first :) The issue with the passwords/accounts is a... [09:04:59] (03CR) 10CI reject: [V:04-1] deployment_server: group chown airflow-wmde kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1170508 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:05:50] (03PS2) 10Brouberol: deployment_server: group chown airflow-wmde kubeconfig files to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1170508 (https://phabricator.wikimedia.org/T399066) [09:06:18] (03PS3) 10Brouberol: deployment_server: group chown airflow-wmde kubeconfigs to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1170508 (https://phabricator.wikimedia.org/T399066) [09:06:18] (03CR) 10CI reject: [V:04-1] deployment_server: group chown airflow-wmde kubeconfigs to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1170508 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:08:37] (03CR) 10Btullis: [C:03+1] deployment_server: group chown airflow-wmde kubeconfigs to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1170508 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:11:35] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6303/co" [puppet] - 10https://gerrit.wikimedia.org/r/1170508 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [09:15:37] (03CR) 10Arnaudb: "some questions inline. Otherwise lgtm, we'll have to be extra careful when we'll sunset `gerrit2`, this patch is another step in that dire" [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [09:18:41] (03PS3) 10Arnaudb: miscweb: re-use httpd base image on miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170464 (https://phabricator.wikimedia.org/T398303) [09:19:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79367 and previous config saved to /var/cache/conftool/dbconfig/20250718-091904-root.json [09:20:46] (03CR) 10Jelto: [C:03+1] "lgtm, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170464 (https://phabricator.wikimedia.org/T398303) (owner: 10Arnaudb) [09:23:52] (03CR) 10Ayounsi: "Nop, it was a typo, problem solved." [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [09:24:52] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:52] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:26:16] (03CR) 10Arnaudb: [C:03+2] miscweb: re-use httpd base image on miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170464 (https://phabricator.wikimedia.org/T398303) (owner: 10Arnaudb) [09:27:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:28:47] (03Merged) 10jenkins-bot: miscweb: re-use httpd base image on miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170464 (https://phabricator.wikimedia.org/T398303) (owner: 10Arnaudb) [09:28:57] (03PS1) 10Cathal Mooney: BGP Policy: Set local-pref to zero on receipt of gshut community [homer/public] - 10https://gerrit.wikimedia.org/r/1170509 (https://phabricator.wikimedia.org/T399931) [09:30:00] (03PS2) 10Cathal Mooney: BGP Policy: Set local-pref to zero on receipt of gshut community [homer/public] - 10https://gerrit.wikimedia.org/r/1170509 (https://phabricator.wikimedia.org/T399931) [09:30:30] (03PS1) 10Btullis: Add the wikitech dump script [dumps] - 10https://gerrit.wikimedia.org/r/1170510 (https://phabricator.wikimedia.org/T398968) [09:30:54] (03PS5) 10Ayounsi: Ganeti Bird BGP [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) [09:32:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:33:26] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: BGP: Support receipt of graceful-shutdown community and set local-pref - https://phabricator.wikimedia.org/T399931#11016186 (10cmooney) [09:34:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1189 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79369 and previous config saved to /var/cache/conftool/dbconfig/20250718-093410-root.json [09:34:28] (03Abandoned) 10Cathal Mooney: Rename YAML var "evpn_bgp" to "switch_ibgp" [homer/public] - 10https://gerrit.wikimedia.org/r/1122208 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [09:35:02] !log arnaudb@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [09:36:41] !log arnaudb@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:36:55] Does anyone mind if I do a services (citoid) deploy? Weird spike in 503s in pyrra/thanos not due to any code change and I think I found the problem locally. [09:39:13] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11016195 (10cmooney) No update from Arelion, asked them to advise on the situation. [09:39:16] !log arnaudb@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [09:39:28] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11016196 (10cmooney) a:03cmooney [09:41:00] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [09:41:06] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:41:22] !log arnaudb@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [09:43:22] !log arnaudb@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [09:45:16] !log arnaudb@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [09:46:25] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:46:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:51:25] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:53:30] (03CR) 10Filippo Giunchedi: [C:03+1] prom/metamonitor: hide DeadManSwitch alerts in Karma [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [09:56:25] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:59:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1254.eqiad.wmnet with reason: Maintenance [09:59:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1254 (T399249)', diff saved to https://phabricator.wikimedia.org/P79370 and previous config saved to /var/cache/conftool/dbconfig/20250718-095938-marostegui.json [09:59:42] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:00:04] (03PS1) 10Stevemunene: dse-k8s: bootstrap dse-k8s-codefw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170514 (https://phabricator.wikimedia.org/T397293) [10:00:04] (03CR) 10Cathal Mooney: [C:03+1] "LGTM nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [10:02:20] (03PS2) 10Stevemunene: dse-k8s: bootstrap dse-k8s-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170514 (https://phabricator.wikimedia.org/T397293) [10:14:49] (03PS1) 10Jelto: miscweb: update miscweb images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170515 (https://phabricator.wikimedia.org/T398303) [10:14:52] (03PS1) 10Jelto: miscweb: update miscweb design images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170516 (https://phabricator.wikimedia.org/T398303) [10:24:07] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170390 (owner: 10PipelineBot) [10:25:52] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170390 (owner: 10PipelineBot) [10:35:10] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [10:35:40] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:35:58] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [10:36:23] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:37:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:37:13] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:37:39] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:39:35] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11016357 (10cmooney) ` 2025-07-18 10:24 Apologies for the inconveniences, Please be informed that investigation is ongoing with our senior engineer and be res... [10:42:10] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:47:10] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:51:22] (03CR) 10Btullis: [C:03+1] "Looks good to me." [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [10:51:45] (03CR) 10Btullis: [C:03+1] dse-k8s: bootstrap dse-k8s-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170514 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [10:52:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:54:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:55:53] (03PS3) 10Cathal Mooney: BGP Policy: Set local-pref to zero on receipt of gshut community [homer/public] - 10https://gerrit.wikimedia.org/r/1170509 (https://phabricator.wikimedia.org/T399931) [10:56:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250718T0700) [11:00:05] jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250718T1100). Please do the needful. [11:00:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T399249)', diff saved to https://phabricator.wikimedia.org/P79371 and previous config saved to /var/cache/conftool/dbconfig/20250718-110033-marostegui.json [11:00:38] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:03:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:03:30] (03CR) 10Cathal Mooney: "Hey Sukhbir thanks for checking. Let's hold off for now I need to review it again, I believe this only covers the case when no '--generat" [dns] - 10https://gerrit.wikimedia.org/r/1164124 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [11:07:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:07:55] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:10:59] (03CR) 10Stevemunene: [C:03+2] dns: Add dse-k8s codfw SRV records [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [11:13:04] !log stevemunene@dns1004 START - running authdns-update [11:14:09] !log stevemunene@dns1004 END - running authdns-update [11:15:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P79372 and previous config saved to /var/cache/conftool/dbconfig/20250718-111541-marostegui.json [11:23:01] (03CR) 10Stevemunene: [C:03+2] dse-k8s: bootstrap dse-k8s-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170514 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [11:27:18] (03PS6) 10Cathal Mooney: Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 [11:30:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P79373 and previous config saved to /var/cache/conftool/dbconfig/20250718-113048-marostegui.json [11:35:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:38:31] (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [11:39:31] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170447 (https://phabricator.wikimedia.org/T363336) (owner: 10Kevin Bazira) [11:39:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1048 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P79374 and previous config saved to /var/cache/conftool/dbconfig/20250718-113933-root.json [11:40:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:41:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:42:43] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:42:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:43:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance [11:43:24] !log Restart pc7 T399540 [11:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:28] T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540 [11:45:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1254 (T399249)', diff saved to https://phabricator.wikimedia.org/P79376 and previous config saved to /var/cache/conftool/dbconfig/20250718-114555-marostegui.json [11:46:00] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:46:10] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:46:11] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1259.eqiad.wmnet with reason: Maintenance [11:46:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1259 (T399249)', diff saved to https://phabricator.wikimedia.org/P79377 and previous config saved to /var/cache/conftool/dbconfig/20250718-114618-marostegui.json [11:48:01] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:49:20] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:49:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:49:57] FIRING: ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:50:25] here [11:50:31] here as well [11:50:36] (03PS1) 10Cathal Mooney: sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) [11:50:55] I acked it, looking if there was maintenance or something [11:51:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:51:18] eqsin is not pooled, right? [11:51:27] it was repooled last night [11:51:31] oh [11:51:59] it is failing on codfw too [11:54:20] I do see the following on the logs [11:54:23] 2025/07/18 11:53:26 [alert] 3856385#3856385: 768 worker_connections are not enough [11:54:31] this on 5001 [11:54:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P79379 and previous config saved to /var/cache/conftool/dbconfig/20250718-115440-root.json [11:54:57] RESOLVED: ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:55:17] what [11:56:26] its the different dcs, I think [11:56:54] ah, sorry, I didn't see it was a resolution [11:56:59] (03CR) 10CI reject: [V:04-1] sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [12:01:13] (03CR) 10Jelto: [C:03+2] miscweb: update miscweb design images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170516 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [12:01:16] (03CR) 10Jelto: [C:03+2] miscweb: update miscweb images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170515 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [12:01:27] 10ops-codfw, 06DC-Ops: Unresponsive management for cirrussearch2089.mgmt:22 - https://phabricator.wikimedia.org/T399943 (10phaultfinder) 03NEW [12:01:36] (03PS2) 10Cathal Mooney: sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) [12:03:12] (03Merged) 10jenkins-bot: miscweb: update miscweb images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170515 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [12:03:18] (03Merged) 10jenkins-bot: miscweb: update miscweb design images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170516 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [12:03:51] (03CR) 10Arnaudb: [C:03+1] miscweb: update miscweb images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170515 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [12:04:04] (03CR) 10Arnaudb: [C:03+1] miscweb: update miscweb design images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170516 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [12:05:59] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [12:07:12] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:08:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:08:10] (03CR) 10CI reject: [V:04-1] sre.hosts.decommision: remove virtual interfaces from during decom [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [12:08:23] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [12:09:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P79380 and previous config saved to /var/cache/conftool/dbconfig/20250718-120946-root.json [12:09:57] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [12:10:48] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [12:12:31] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [12:13:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:13:22] (03CR) 10DDesouza: [V:03+1 C:03+1] miscweb: update miscweb design images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170516 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [12:17:02] (03PS1) 10Marostegui: db1198: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170534 (https://phabricator.wikimedia.org/T399548) [12:18:11] (03CR) 10Marostegui: [C:03+2] db1198: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170534 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [12:18:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:19:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1198 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79381 and previous config saved to /var/cache/conftool/dbconfig/20250718-121901-marostegui.json [12:22:29] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: hide DeadManSwitch alerts in Karma (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [12:23:25] (03PS1) 10Vgutierrez: hiera: Disable paging for ncredir-https [puppet] - 10https://gerrit.wikimedia.org/r/1170536 [12:24:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: row C/D switch refresh - https://phabricator.wikimedia.org/T396063#11016669 (10Jclark-ctr) Adjusted Mgmt in Rack D1 , D8 down to place spines in top of rack. Updated netbox and installed Rails [12:24:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1048 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79382 and previous config saved to /var/cache/conftool/dbconfig/20250718-122452-root.json [12:26:59] (03CR) 10Jcrespo: [C:03+1] hiera: Disable paging for ncredir-https [puppet] - 10https://gerrit.wikimedia.org/r/1170536 (owner: 10Vgutierrez) [12:27:27] (03CR) 10Vgutierrez: [C:03+2] hiera: Disable paging for ncredir-https [puppet] - 10https://gerrit.wikimedia.org/r/1170536 (owner: 10Vgutierrez) [12:29:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79383 and previous config saved to /var/cache/conftool/dbconfig/20250718-122914-root.json [12:34:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:35:11] (03PS1) 10Tiziano Fogli: prom/metamonitor: fix typo on karma erb config file [puppet] - 10https://gerrit.wikimedia.org/r/1170540 (https://phabricator.wikimedia.org/T397003) [12:35:23] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: fix typo on karma erb config file [puppet] - 10https://gerrit.wikimedia.org/r/1170540 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [12:35:35] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] prom/metamonitor: fix typo on karma erb config file [puppet] - 10https://gerrit.wikimedia.org/r/1170540 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [12:35:57] (03PS1) 10Marostegui: installserver: Do not format es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1170541 [12:39:20] (03PS1) 10Cathal Mooney: cephosd: un-set bird bgp neighbors rather than override for each host [puppet] - 10https://gerrit.wikimedia.org/r/1170543 [12:39:50] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1170541 (owner: 10Marostegui) [12:39:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1048 (re)pooling @ 35%: Repooling', diff saved to https://phabricator.wikimedia.org/P79384 and previous config saved to /var/cache/conftool/dbconfig/20250718-123958-root.json [12:40:25] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170543 (owner: 10Cathal Mooney) [12:41:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11016724 (10Jclark-ctr) Replaced Failed drive powering up now [12:42:56] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11016725 (10jcrespo) Thank you! [12:44:10] (03CR) 10Ayounsi: [C:03+1] "nice lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1170509 (https://phabricator.wikimedia.org/T399931) (owner: 10Cathal Mooney) [12:44:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79385 and previous config saved to /var/cache/conftool/dbconfig/20250718-124419-root.json [12:44:35] (03PS1) 10Jelto: miscweb: update miscweb images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170544 (https://phabricator.wikimedia.org/T398303) [12:45:11] ACKNOWLEDGEMENT - MegaRAID on backup1007 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T399948 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:45:15] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399948 (10ops-monitoring-bot) 03NEW [12:45:42] (03PS1) 10Tiziano Fogli: prom/metamonitor: fix indentation on karma erb config file [puppet] - 10https://gerrit.wikimedia.org/r/1170545 (https://phabricator.wikimedia.org/T397003) [12:46:30] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: fix indentation on karma erb config file [puppet] - 10https://gerrit.wikimedia.org/r/1170545 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [12:48:15] (03CR) 10Ayounsi: "overall lgtm, I'd suggest to test it on a sretest hosts on the prod instance with test-cookbook." [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [12:49:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T399249)', diff saved to https://phabricator.wikimedia.org/P79386 and previous config saved to /var/cache/conftool/dbconfig/20250718-124901-marostegui.json [12:49:06] (03CR) 10Cathal Mooney: "OK yep, I guess I can add some virtual ints and stuff to sretest or mess with it without risking too much - good idea!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [12:49:08] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:49:38] (03CR) 10Cathal Mooney: "If you've any idea about the CI error I'm all ears. Too many branches but I don't see an easy way to avoid it here." [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [12:51:45] (03CR) 10Jelto: [C:03+2] miscweb: update miscweb images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170544 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [12:53:57] (03Merged) 10jenkins-bot: miscweb: update miscweb images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170544 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [12:55:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1048 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79387 and previous config saved to /var/cache/conftool/dbconfig/20250718-125504-root.json [12:55:08] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [12:55:25] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [12:56:00] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [12:56:01] (03PS2) 10Cathal Mooney: cephosd: un-set bird bgp neighbors rather than override for each host [puppet] - 10https://gerrit.wikimedia.org/r/1170543 [12:56:22] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [12:56:52] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [12:57:05] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11016776 (10jcrespo) I'm afraid the new disk has not been detected: {F65180194} (it is not out of order, either) We are still running in degraded mode (with 1 less disk). [12:57:12] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [12:57:38] (03PS1) 10C. Scott Ananian: Enable the "Report Visual Bug" feature of Extension:ParserMigration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) [12:57:57] (03PS1) 10Elukey: pyrra: simplify multi-dc handling for istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1170550 (https://phabricator.wikimedia.org/T398534) [12:58:17] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170543 (owner: 10Cathal Mooney) [12:58:17] !log jelto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply [12:58:26] (03CR) 10CI reject: [V:04-1] Enable the "Report Visual Bug" feature of Extension:ParserMigration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170549 (https://phabricator.wikimedia.org/T365371) (owner: 10C. Scott Ananian) [12:58:58] !log jelto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply [12:59:03] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6306/co" [puppet] - 10https://gerrit.wikimedia.org/r/1170550 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [12:59:06] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply [12:59:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79388 and previous config saved to /var/cache/conftool/dbconfig/20250718-125925-root.json [12:59:40] !log jelto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply [13:00:43] (03CR) 10Vgutierrez: [C:03+1] pyrra: simplify multi-dc handling for istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1170550 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [13:02:11] (03CR) 10Ayounsi: "`# pylint disable=too-many-branches` :) If volans is ok of course. Otherwise we would need to refactor and split some processing in their " [cookbooks] - 10https://gerrit.wikimedia.org/r/1170530 (https://phabricator.wikimedia.org/T398412) (owner: 10Cathal Mooney) [13:02:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:02:18] (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: simplify multi-dc handling for istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1170550 (https://phabricator.wikimedia.org/T398534) (owner: 10Elukey) [13:02:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:04:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P79389 and previous config saved to /var/cache/conftool/dbconfig/20250718-130410-marostegui.json [13:04:18] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11016820 (10jcrespo) [13:04:22] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399948#11016822 (10jcrespo) →14Duplicate dup:03T399847 [13:05:07] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:05:21] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:05:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:07:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:07:50] (03PS1) 10Marostegui: db1212: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170551 (https://phabricator.wikimedia.org/T399548) [13:08:22] (03CR) 10Ssingh: [C:03+1] acme_chief: Delete empty directories after pruning expired certs [puppet] - 10https://gerrit.wikimedia.org/r/1170497 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez) [13:09:40] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:09:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:10:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1048 (re)pooling @ 65%: Repooling', diff saved to https://phabricator.wikimedia.org/P79390 and previous config saved to /var/cache/conftool/dbconfig/20250718-131009-root.json [13:11:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11016853 (10ssingh) @aranyap: It seems like the group membership has been updated. Can you please try again? Thanks! [13:12:19] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:12:21] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:12:26] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:12:30] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:14:13] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:14:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79391 and previous config saved to /var/cache/conftool/dbconfig/20250718-131431-root.json [13:15:03] (03CR) 10Marostegui: [C:03+2] db1212: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170551 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [13:15:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 10 hosts with reason: Maintenance [13:15:25] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:15:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:15:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1212 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79392 and previous config saved to /var/cache/conftool/dbconfig/20250718-131554-marostegui.json [13:16:34] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:17:02] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:17:56] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:18:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:19:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P79393 and previous config saved to /var/cache/conftool/dbconfig/20250718-131917-marostegui.json [13:21:32] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Noarave from WMF systems - https://phabricator.wikimedia.org/T399953 (10karapayneWMDE) 03NEW [13:22:51] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11016884 (10jcrespo) I will put the server back into service so the service is not down during during the weekend and figure out a way to resolve this next week. [13:23:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:23:45] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on backup1007.eqiad.wmnet with reason: failed disk [13:23:56] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11016885 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=89313787-9150-425e-afd0-6f2bea491334) set by jynus@cumin1003 for 3 days, 0:00:00 on 1 host(s) and their services with reason: fail... [13:24:16] (03CR) 10Btullis: [C:03+2] Add the wikitech dump script [dumps] - 10https://gerrit.wikimedia.org/r/1170510 (https://phabricator.wikimedia.org/T398968) (owner: 10Btullis) [13:24:40] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1170554 (https://phabricator.wikimedia.org/T399954) [13:24:45] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170555 (https://phabricator.wikimedia.org/T399954) [13:25:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79394 and previous config saved to /var/cache/conftool/dbconfig/20250718-132515-root.json [13:26:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79395 and previous config saved to /var/cache/conftool/dbconfig/20250718-132638-root.json [13:27:46] (03CR) 10Ayounsi: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [13:28:52] (03CR) 10DDesouza: [C:03+1] miscweb: update miscweb images to new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170544 (https://phabricator.wikimedia.org/T398303) (owner: 10Jelto) [13:29:36] (03CR) 10Cathal Mooney: [C:03+2] BGP Policy: Set local-pref to zero on receipt of gshut community [homer/public] - 10https://gerrit.wikimedia.org/r/1170509 (https://phabricator.wikimedia.org/T399931) (owner: 10Cathal Mooney) [13:30:09] (03Merged) 10jenkins-bot: BGP Policy: Set local-pref to zero on receipt of gshut community [homer/public] - 10https://gerrit.wikimedia.org/r/1170509 (https://phabricator.wikimedia.org/T399931) (owner: 10Cathal Mooney) [13:30:43] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [13:34:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T399249)', diff saved to https://phabricator.wikimedia.org/P79396 and previous config saved to /var/cache/conftool/dbconfig/20250718-133424-marostegui.json [13:34:31] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:35:26] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [13:35:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T399249)', diff saved to https://phabricator.wikimedia.org/P79397 and previous config saved to /var/cache/conftool/dbconfig/20250718-133533-marostegui.json [13:35:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:37:01] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [13:37:39] (03PS1) 10Marostegui: db2242: Fix section [puppet] - 10https://gerrit.wikimedia.org/r/1170558 [13:38:10] (03CR) 10CI reject: [V:04-1] Capirca: handle script having no 'status' attribute gracefully [software/homer] - 10https://gerrit.wikimedia.org/r/1166373 (owner: 10Cathal Mooney) [13:38:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2242.codfw.wmnet with reason: Maintenance [13:38:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:39:12] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2242.codfw.wmnet [13:39:21] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db2242 - Upgrading db2242.codfw.wmnet [13:39:36] (03CR) 10Marostegui: [C:03+2] db2242: Fix section [puppet] - 10https://gerrit.wikimedia.org/r/1170558 (owner: 10Marostegui) [13:39:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2242 - Upgrading db2242.codfw.wmnet [13:40:07] (03CR) 10Marostegui: [C:03+2] "It was all good in zarcillo" [puppet] - 10https://gerrit.wikimedia.org/r/1170558 (owner: 10Marostegui) [13:40:19] (03CR) 10Vgutierrez: [C:03+2] acme_chief: Delete empty directories after pruning expired certs [puppet] - 10https://gerrit.wikimedia.org/r/1170497 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez) [13:40:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79399 and previous config saved to /var/cache/conftool/dbconfig/20250718-134021-root.json [13:40:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:41:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79400 and previous config saved to /var/cache/conftool/dbconfig/20250718-134144-root.json [13:42:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11016986 (10ayounsi) From the network side it does indeed try to fetch the URL through TFTP... ` install1004:~$ sudo tcpdump host 10.64.159.5 tcpdump: verbose output... [13:45:23] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db2242 gradually with 4 steps - Upgrade of db2242.codfw.wmnet completed [13:45:40] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 208.80.153.220 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:49:43] (03PS1) 10Elukey: pyrra: fix Istio latency metric config with latency_target_requests_regex [puppet] - 10https://gerrit.wikimedia.org/r/1170564 (https://phabricator.wikimedia.org/T390706) [13:50:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:50:59] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6315/co" [puppet] - 10https://gerrit.wikimedia.org/r/1170564 (https://phabricator.wikimedia.org/T390706) (owner: 10Elukey) [13:54:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017038 (10ssingh) Hi @REsquito-WMF: I am trying to understand if analytics-privatedata-users is really required for this. Can you clarify the reason for your access a bit mo... [13:55:26] (03CR) 10Vgutierrez: [C:03+1] pyrra: fix Istio latency metric config with latency_target_requests_regex [puppet] - 10https://gerrit.wikimedia.org/r/1170564 (https://phabricator.wikimedia.org/T390706) (owner: 10Elukey) [13:55:33] (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: fix Istio latency metric config with latency_target_requests_regex [puppet] - 10https://gerrit.wikimedia.org/r/1170564 (https://phabricator.wikimedia.org/T390706) (owner: 10Elukey) [13:56:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79402 and previous config saved to /var/cache/conftool/dbconfig/20250718-135650-root.json [14:02:29] !log Running `foreachwiki AbuseFilter:PopulateAbuseFilterLogIPHex.php --batch-size 1000 --sleep 1` for T397842 [14:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:33] T397842: Populate afl_ip_hex for pre-existing abuse_filter_log rows - https://phabricator.wikimedia.org/T397842 [14:02:58] (03PS9) 10Daimona Eaytoy: Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) [14:04:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017063 (10ssingh) [14:05:20] !log Stopped the previous command [14:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:33] !log Running `foreachwiki AbuseFilter:PopulateAbuseFilterLogIPHex.php` for T397842 [14:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:05:57] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:06:36] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017065 (10REsquito-WMF) HI I will need acess to data lake, hive, and others. Also Adam Baso just mentioned to me that I missing wmf group. [14:06:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167900 (https://phabricator.wikimedia.org/T183549) (owner: 10Jforrester) [14:07:09] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission sretest2007/sretest2008 - https://phabricator.wikimedia.org/T399447#11017066 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:07:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [14:07:30] (03PS5) 10Daimona Eaytoy: Add a test to verify that "normal" DBLists contain only SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) [14:08:21] (03CR) 10Jforrester: "Thanks for deploying, I got busy yesterday and ran out of time!" [extensions/FlaggedRevs] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170318 (https://phabricator.wikimedia.org/T399641) (owner: 10Jforrester) [14:11:29] (03CR) 10Brouberol: [V:03+1 C:03+2] deployment_server: group chown airflow-wmde kubeconfigs to airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1170508 (https://phabricator.wikimedia.org/T399066) (owner: 10Brouberol) [14:11:49] (03CR) 10Jforrester: [C:03+1] Move special wikis outside of the 'wikipedia' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167880 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [14:11:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167890 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [14:11:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1212 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79404 and previous config saved to /var/cache/conftool/dbconfig/20250718-141156-root.json [14:12:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017086 (10ssingh) >>! In T399899#11017065, @REsquito-WMF wrote: > HI > > I will need acess to data lake, hive, and others. > > Also Adam Baso just mentioned to me that I m... [14:12:55] (03PS2) 10Daimona Eaytoy: Clean up some settings for special wikis no longer in wikipedia group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168169 (https://phabricator.wikimedia.org/T183549) [14:13:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168169 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [14:13:21] (03PS1) 10Jforrester: Clean up wmgWikibaseSiteGroup list, alpha-sort and de-dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170565 [14:13:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:58] (03PS10) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [14:13:58] (03PS5) 10Elukey: DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 [14:14:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [14:14:42] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:14:50] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:15:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:18:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:04] (03PS11) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [14:19:04] (03PS6) 10Elukey: DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 [14:20:09] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:20:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:21:03] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:21:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11017119 (10elukey) Due to a bug in my provisioning-changes I was missing these: ` BIOS: IPv4HTTPSupport is set to Disabled, while we want Enabled BIOS: IPv4PXESupport... [14:22:34] (03PS1) 10Fabfur: haproxy: this commit deliberately contains a syntax error in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1170567 [14:23:02] (03PS2) 10Fabfur: haproxy: this commit deliberately contains a syntax error in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1170567 [14:25:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:25:27] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:25:47] (03CR) 10CI reject: [V:04-1] DNM - test for ML hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1170463 (owner: 10Elukey) [14:25:48] (03CR) 10CI reject: [V:04-1] WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) (owner: 10Elukey) [14:30:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:30:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2242 gradually with 4 steps - Upgrade of db2242.codfw.wmnet completed [14:30:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2242.codfw.wmnet [14:33:32] (03PS1) 10Scott French: httpd: clean up transitional -bookworm track [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1170405 (https://phabricator.wikimedia.org/T378128) [14:33:44] (03PS1) 10Ayounsi: WIP: Bird: VM side - add support for Routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) [14:35:13] (03CR) 10Scott French: [V:03+2] "No longer processed by docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1170405 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:35:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:38:30] (03PS2) 10Ayounsi: WIP: Bird: VM side - add support for Routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) [14:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:39:24] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170570 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:43:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017204 (10dr0ptp4kt) Thanks @ssingh - I'm wondering, should we create a subheading between https://wikitech.wikimedia.org/wiki/SRE/Production_access#Generating_your_SSH_key... [14:44:24] (03CR) 10Effie Mouzeli: [C:03+1] httpd: clean up transitional -bookworm track [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1170405 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:44:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11017207 (10dr0ptp4kt) Just for visibility here as these are around the same time: thread over at T399899#11017204 that's related, heads up. [14:47:14] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Noarave from WMF systems - https://phabricator.wikimedia.org/T399953#11017210 (10ssingh) a:03joanna_borun [14:51:39] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11017233 (10fnegri) a:05fnegri→03Andrew The growth rate is slowing, but it's not flatlining as I hoped... So the s... [14:53:59] (03PS1) 10Ssingh: admin: add resquito to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1170571 (https://phabricator.wikimedia.org/T399899) [14:54:12] (03PS2) 10Ssingh: admin: add resquito to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1170571 (https://phabricator.wikimedia.org/T399899) [14:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:56] (03CR) 10CI reject: [V:04-1] admin: add resquito to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1170571 (https://phabricator.wikimedia.org/T399899) (owner: 10Ssingh) [14:55:14] (03PS3) 10Ssingh: admin: add resquito to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1170571 (https://phabricator.wikimedia.org/T399899) [15:00:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017246 (10dr0ptp4kt) >>! In T399899#11017038, @ssingh wrote: > Hi @REsquito-WMF: I am trying to understand if analytics-privatedata-users is really req... [15:00:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:01] (03PS1) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [15:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017262 (10ssingh) Hi @dr0ptp4kt: >>! In T399899#11017204, @dr0ptp4kt wrote: > Thanks @ssingh - I'm wondering, should we create a subheading between ht... [15:09:29] (03CR) 10CI reject: [V:04-1] haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [15:14:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11017270 (10elukey) @Jclark-ctr for some reason ml-serve1012 seems stuck, I am not able to powercycle it from the mgmt console. Would you mind to hard reset it when you... [15:15:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Ufinet (187.108.235.25) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:17:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017283 (10dr0ptp4kt) Thanks @ssingh ! I think it's probably just a matter of updating the pages. I've had my access for a good while now, and I bet the... [15:18:44] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017284 (10ssingh) >>! In T399899#11017246, @dr0ptp4kt wrote: >>>! In T399899#11017038, @ssingh wrote: >> Hi @REsquito-WMF: I am trying to understand if... [15:20:53] (03PS2) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [15:21:39] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964 (10RobH) 03NEW [15:21:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:21:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [15:22:46] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11017307 (10RobH) [15:23:26] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11017308 (10RobH) a:03BTullis Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) a... [15:24:09] (03PS1) 10Federico Ceratto: zarcillo: allow egress to gerrit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170574 (https://phabricator.wikimedia.org/T389663) [15:24:10] (03CR) 10Federico Ceratto: "A small addition in egress regarding gerrit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170574 (https://phabricator.wikimedia.org/T389663) (owner: 10Federico Ceratto) [15:25:16] !log hashar@deploy1003 Started deploy [integration/docroot@6384514]: build: Updating mediawiki/mediawiki-phan-config to 0.16.0 [15:25:29] !log hashar@deploy1003 Finished deploy [integration/docroot@6384514]: build: Updating mediawiki/mediawiki-phan-config to 0.16.0 (duration: 00m 12s) [15:28:01] (03CR) 10BCornwall: [C:03+1] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170555 (https://phabricator.wikimedia.org/T399954) (owner: 10Gerrit maintenance bot) [15:29:21] (03CR) 10Fabfur: "if anyone would like to try the commands I used to test it were:" [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [15:31:05] (03CR) 10Fabfur: [C:04-2] haproxy: this commit deliberately contains a syntax error in haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1170567 (owner: 10Fabfur) [15:32:30] (03CR) 10BCornwall: [C:03+1] "verified the uid and key are correct" [puppet] - 10https://gerrit.wikimedia.org/r/1170571 (https://phabricator.wikimedia.org/T399899) (owner: 10Ssingh) [15:35:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Ufinet (187.108.235.25) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:36:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:36:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017327 (10ssingh) [15:39:06] (03PS3) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [15:39:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11017329 (10ssingh) @aranyap: Also please note that it seems like you are using the same key for WMCS and production: ` aranyap uses the same SSH key(... [15:39:37] (03CR) 10Ssingh: [C:03+2] admin: add resquito to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1170571 (https://phabricator.wikimedia.org/T399899) (owner: 10Ssingh) [15:41:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:41:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11017333 (10ssingh) @REsquito-WMF: Your access request has been merged. Please allow ~30 minutes for it to roll out. I have also added you to the `wmf` n... [15:42:42] (03CR) 10Vgutierrez: [C:04-1] haproxy: script to perform configuration validation (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [15:48:45] (03PS14) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [15:53:13] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1165187 (owner: 10Ncmonitor) [15:55:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:55:18] (03CR) 10CI reject: [V:04-1] Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [15:55:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T399249)', diff saved to https://phabricator.wikimedia.org/P79407 and previous config saved to /var/cache/conftool/dbconfig/20250718-155542-marostegui.json [15:55:47] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:00:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:01:04] (03CR) 10Dzahn: [C:03+2] zuul::main: install apparmor-utils, needed for docker [puppet] - 10https://gerrit.wikimedia.org/r/1170444 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:01:11] (03PS2) 10Dzahn: zuul::main: install apparmor-utils, needed for docker [puppet] - 10https://gerrit.wikimedia.org/r/1170444 (https://phabricator.wikimedia.org/T395938) [16:07:14] (03CR) 10Dzahn: [C:03+2] zuul::main: install apparmor-utils, needed for docker [puppet] - 10https://gerrit.wikimedia.org/r/1170444 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:09:58] (03PS4) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [16:10:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P79408 and previous config saved to /var/cache/conftool/dbconfig/20250718-161050-marostegui.json [16:10:55] (03CR) 10Fabfur: haproxy: script to perform configuration validation (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [16:16:58] (03CR) 10Scott French: [V:03+2] "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1170405 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [16:17:28] (03CR) 10Scott French: [V:03+2 C:03+2] httpd: clean up transitional -bookworm track [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1170405 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [16:21:56] (03PS15) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [16:25:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P79409 and previous config saved to /var/cache/conftool/dbconfig/20250718-162557-marostegui.json [16:29:04] (03CR) 10CI reject: [V:04-1] Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [16:33:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:34:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:38:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:41:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T399249)', diff saved to https://phabricator.wikimedia.org/P79410 and previous config saved to /var/cache/conftool/dbconfig/20250718-164105-marostegui.json [16:41:10] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:41:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [16:41:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T399249)', diff saved to https://phabricator.wikimedia.org/P79411 and previous config saved to /var/cache/conftool/dbconfig/20250718-164128-marostegui.json [16:43:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:53:27] (03PS5) 10Fabfur: haproxy: script to perform configuration validation [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) [16:54:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:55:09] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [16:55:10] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [16:55:16] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [16:55:16] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [16:55:45] (03CR) 10Fabfur: haproxy: script to perform configuration validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170572 (https://phabricator.wikimedia.org/T399941) (owner: 10Fabfur) [16:55:56] !log fceratto@cumin1002 START - Cookbook sre.mysql.parsercache [16:55:57] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [16:56:50] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170584 [16:58:55] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:11:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11017530 (10Jclark-ctr) Power cycled ml-server1012 [17:30:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:33:30] 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11017567 (10Eevans) >>! In T215183#11014363, @Eevans wrote: > Has there been any progress toward goal #2? I didn't see where anything had been added to the mentioned runbook. > > For conte... [17:34:11] (03PS16) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [17:35:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:35:40] (03PS17) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [17:47:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:47:55] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:58:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:03:55] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:41:33] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11017710 (10aranyap) @ssingh I'm now able to access JupyterHub and have deleted the WMCS key. Thank you! [18:44:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:47:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11017724 (10ssingh) 05Open→03Resolved Thanks for resolving the WMC key issue! [18:49:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:55:40] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:56:33] (03CR) 10Dzahn: [V:03+1] gerrit: replace host names in replica config with variables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:58:09] (03CR) 10Dzahn: [V:03+1] gerrit: replace host names in replica config with variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:59:52] (03PS2) 10Dzahn: aphlict: create system user with systemd:sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) [18:59:58] (03PS3) 10Dzahn: aphlict: create system user with systemd:sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) [19:00:30] (03CR) 10Dzahn: "@mmuhlenhoff@wikimedia.org not -1 anymore now, since we are on bookworm. right?" [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) (owner: 10Dzahn) [19:00:40] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:04:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T399249)', diff saved to https://phabricator.wikimedia.org/P79413 and previous config saved to /var/cache/conftool/dbconfig/20250718-190416-marostegui.json [19:04:22] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [19:06:30] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10Infrastructure Security, and 2 others: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#11017745 (10nisrael) Hi SRE team, Checking in on this task. Do you have an approximate timeline when we'd be able to configure the mail r... [19:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:19:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P79414 and previous config saved to /var/cache/conftool/dbconfig/20250718-191924-marostegui.json [19:34:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P79415 and previous config saved to /var/cache/conftool/dbconfig/20250718-193431-marostegui.json [19:39:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:49:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T399249)', diff saved to https://phabricator.wikimedia.org/P79416 and previous config saved to /var/cache/conftool/dbconfig/20250718-194938-marostegui.json [19:49:43] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [19:49:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:49:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T399249)', diff saved to https://phabricator.wikimedia.org/P79417 and previous config saved to /var/cache/conftool/dbconfig/20250718-194951-marostegui.json [20:13:37] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:14:03] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:24:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:29:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:32:54] 06SRE, 10SRE-Access-Requests: Access Request to DMarcDigests - https://phabricator.wikimedia.org/T399976#11017861 (10Johannnes89) [20:34:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:38:08] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10Mail: Access Request to DMarcDigests - https://phabricator.wikimedia.org/T399976#11017873 (10Dzahn) [20:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:41:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:46:40] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:09:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:10:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:15:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:20:10] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:25:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:28:24] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [21:28:28] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [21:31:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:49:12] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [21:49:18] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [21:57:03] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [21:57:08] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [21:57:09] 06SRE, 10Wikimedia-Mailing-lists: Archive affiliates-l - https://phabricator.wikimedia.org/T399878#11017983 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [22:11:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T399249)', diff saved to https://phabricator.wikimedia.org/P79418 and previous config saved to /var/cache/conftool/dbconfig/20250718-221112-marostegui.json [22:11:17] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:24:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:26:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P79419 and previous config saved to /var/cache/conftool/dbconfig/20250718-222620-marostegui.json [22:29:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:41:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P79420 and previous config saved to /var/cache/conftool/dbconfig/20250718-224127-marostegui.json [22:54:13] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:56:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T399249)', diff saved to https://phabricator.wikimedia.org/P79421 and previous config saved to /var/cache/conftool/dbconfig/20250718-225635-marostegui.json [22:56:40] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:56:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [22:56:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T399249)', diff saved to https://phabricator.wikimedia.org/P79422 and previous config saved to /var/cache/conftool/dbconfig/20250718-225658-marostegui.json [23:03:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:06:18] (03PS1) 10Krinkle: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170614 [23:07:07] (03CR) 10CI reject: [V:04-1] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170614 (owner: 10Krinkle) [23:08:01] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138933 (owner: 10Zabe) [23:08:40] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:09:26] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170615 [23:38:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170615 (owner: 10TrainBranchBot) [23:39:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:44:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:48:08] (03PS2) 10Krinkle: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170614 [23:48:08] (03PS1) 10Krinkle: build: Fix failing `phpcs` in CI on commits updating interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170616 [23:50:40] FIRING: [3x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:51:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170615 (owner: 10TrainBranchBot) [23:55:40] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:55:55] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown