[00:05:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P79274 and previous config saved to /var/cache/conftool/dbconfig/20250717-000537-marostegui.json [00:07:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:08:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170219 [00:08:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170219 (owner: 10TrainBranchBot) [00:09:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:14:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:14:57] FIRING: [2x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:20:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T399249)', diff saved to https://phabricator.wikimedia.org/P79275 and previous config saved to /var/cache/conftool/dbconfig/20250717-002045-marostegui.json [00:20:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2226.codfw.wmnet with reason: Maintenance [00:20:49] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [00:20:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2226 (T399249)', diff saved to https://phabricator.wikimedia.org/P79276 and previous config saved to /var/cache/conftool/dbconfig/20250717-002056-marostegui.json [00:26:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:26:17] (03PS5) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) [00:27:29] (03CR) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [00:29:20] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1170219 (owner: 10TrainBranchBot) [00:31:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:58:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:01:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T399249)', diff saved to https://phabricator.wikimedia.org/P79277 and previous config saved to /var/cache/conftool/dbconfig/20250717-010111-marostegui.json [01:01:16] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [01:03:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:03:48] 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#11011645 (10KartikMistry) @Dzahn Yes. We can remove MinT models from our home directori... [01:08:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:16:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P79278 and previous config saved to /var/cache/conftool/dbconfig/20250717-011619-marostegui.json [01:31:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226', diff saved to https://phabricator.wikimedia.org/P79279 and previous config saved to /var/cache/conftool/dbconfig/20250717-013127-marostegui.json [01:46:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2226 (T399249)', diff saved to https://phabricator.wikimedia.org/P79280 and previous config saved to /var/cache/conftool/dbconfig/20250717-014635-marostegui.json [01:46:40] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [01:46:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2238.codfw.wmnet with reason: Maintenance [01:46:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2238 (T399249)', diff saved to https://phabricator.wikimedia.org/P79281 and previous config saved to /var/cache/conftool/dbconfig/20250717-014658-marostegui.json [01:51:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:56:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:04:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [02:21:07] PROBLEM - Disk space on dbprov2003 is CRITICAL: DISK CRITICAL - free space: /srv 391035MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops [02:25:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:30:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:50:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T399249)', diff saved to https://phabricator.wikimedia.org/P79282 and previous config saved to /var/cache/conftool/dbconfig/20250717-025002-marostegui.json [02:50:08] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [02:52:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:54:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:57:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:05:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P79283 and previous config saved to /var/cache/conftool/dbconfig/20250717-030511-marostegui.json [03:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:14:21] (03CR) 10Krinkle: "Yeah, I've gone ahead and swapped the 2021 patch version for PS8 on deployment prep. Unlike the 2021 version, which fixed beta by breaking" [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) (owner: 10Krinkle) [03:20:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238', diff saved to https://phabricator.wikimedia.org/P79284 and previous config saved to /var/cache/conftool/dbconfig/20250717-032020-marostegui.json [03:21:07] RECOVERY - Disk space on dbprov2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops [03:22:13] (03PS12) 10Krinkle: beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [03:24:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:25:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:30:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:31:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:32:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [03:35:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2238 (T399249)', diff saved to https://phabricator.wikimedia.org/P79285 and previous config saved to /var/cache/conftool/dbconfig/20250717-033528-marostegui.json [03:35:32] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [03:36:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:42:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:47:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:51:45] 06SRE, 06Traffic: "Backend fetch failed" on edit save - https://phabricator.wikimedia.org/T382790#11011741 (10BCornwall) Hi, @MGChecker! I apologize for the long delay in getting back to you on this. Would you say that this is still an issue since you opened the task? [03:54:52] 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#11011752 (10BCornwall) @Underbar_dk Since some time has passed, have you observed similar difficulties with various sites on IPv6? Or would you say that... [03:57:10] 06SRE, 06Commons, 06Traffic: Backend fetch failed - https://phabricator.wikimedia.org/T383013#11011754 (10BCornwall) Hi, @Jeff_G! I apologize for the delay in getting to you on this - Would you say this was a transient issue or a persistent one? [04:13:40] (03CR) 10Dragoniez: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [04:14:57] FIRING: [2x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:23:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:26:46] (03PS6) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) [04:27:39] (03CR) 10Tryvix1509: "Could you please re-review again, sorry but I didn't update commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [04:28:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:06:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:09:42] FIRING: [3x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:11:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:17:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:19:42] FIRING: [3x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:21:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [05:21:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:40:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:45:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:50:01] (03PS2) 10Effie Mouzeli: prometheus::ops add job to scrape hCaptcha proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) [05:52:55] (03CR) 10Effie Mouzeli: "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [05:53:07] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T0600) [06:00:05] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T0600). [06:02:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: maintenance [06:05:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:06:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1258 with weight 0 T399699', diff saved to https://phabricator.wikimedia.org/P79286 and previous config saved to /var/cache/conftool/dbconfig/20250717-060629-root.json [06:06:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Primary switchover x3 T399699 [06:06:34] T399699: Switchover x3 master (db1255 -> db1258) - https://phabricator.wikimedia.org/T399699 [06:07:47] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1258 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1170098 (https://phabricator.wikimedia.org/T399699) (owner: 10Gerrit maintenance bot) [06:09:29] !log Starting x3 eqiad failover from db1255 to db1258 - T399699 [06:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:12:07] (03PS1) 10Marostegui: dbconfig.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1170225 (https://phabricator.wikimedia.org/T399699) [06:12:35] (03CR) 10Marostegui: "This was breaking dbctl during the switchover" [puppet] - 10https://gerrit.wikimedia.org/r/1170225 (https://phabricator.wikimedia.org/T399699) (owner: 10Marostegui) [06:13:31] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1170225 (https://phabricator.wikimedia.org/T399699) (owner: 10Marostegui) [06:14:22] (03CR) 10Marostegui: [C:03+2] dbconfig.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1170225 (https://phabricator.wikimedia.org/T399699) (owner: 10Marostegui) [06:15:13] (03CR) 10Ryan Kemper: Replace elasticsearch api with python requests (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [06:17:04] (03CR) 10Effie Mouzeli: [C:03+2] prometheus::ops add job to scrape hCaptcha proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [06:18:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set x3 eqiad as read-only for maintenance - T399699', diff saved to https://phabricator.wikimedia.org/P79287 and previous config saved to /var/cache/conftool/dbconfig/20250717-061800-root.json [06:18:05] T399699: Switchover x3 master (db1255 -> db1258) - https://phabricator.wikimedia.org/T399699 [06:18:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1258 to x3 primary and set section read-write T399699', diff saved to https://phabricator.wikimedia.org/P79288 and previous config saved to /var/cache/conftool/dbconfig/20250717-061832-marostegui.json [06:19:05] (03CR) 10Marostegui: [C:03+2] wmnet: Update x3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170099 (https://phabricator.wikimedia.org/T399699) (owner: 10Gerrit maintenance bot) [06:19:08] !log marostegui@dns1006 START - running authdns-update [06:19:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1255 T399699', diff saved to https://phabricator.wikimedia.org/P79289 and previous config saved to /var/cache/conftool/dbconfig/20250717-061943-marostegui.json [06:20:02] !log marostegui@dns1006 END - running authdns-update [06:22:29] (03PS1) 10Marostegui: db1211: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170227 (https://phabricator.wikimedia.org/T399298) [06:24:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 10 hosts with reason: Maintenance [06:24:43] (03CR) 10Marostegui: [C:03+2] db1211: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170227 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [06:25:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1255.eqiad.wmnet with reason: Maintenance [06:26:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1211.eqiad.wmnet with reason: Maintenance [06:27:49] (03PS1) 10Marostegui: db1255: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170228 (https://phabricator.wikimedia.org/T399298) [06:28:16] (03CR) 10Marostegui: [C:03+2] db1255: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170228 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [06:29:43] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [06:29:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:30:13] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [06:30:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:33:20] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [06:33:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79291 and previous config saved to /var/cache/conftool/dbconfig/20250717-063327-root.json [06:34:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:34:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2204.codfw.wmnet with reason: Maintenance [06:34:42] FIRING: [3x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:35:51] (03CR) 10Tiziano Fogli: [C:03+1] raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo) [06:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:39:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:39:16] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [06:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:48:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79292 and previous config saved to /var/cache/conftool/dbconfig/20250717-064833-root.json [06:48:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [06:51:29] (03PS1) 10Marostegui: db2205: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170229 (https://phabricator.wikimedia.org/T399548) [06:54:18] (03CR) 10Jcrespo: [C:03+2] raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo) [06:54:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:54:42] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:57:33] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6296/co" [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [06:57:48] (03CR) 10Elukey: [V:03+1 C:03+2] statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [07:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T0700). [07:00:05] georgekyz, Hide_on_rosie, and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] (03CR) 10Marostegui: [C:03+2] db2205: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170229 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [07:00:55] Hey folks, I am going to start the deployment right now [07:01:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2205.codfw.wmnet with reason: Maintenance [07:01:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2205 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79293 and previous config saved to /var/cache/conftool/dbconfig/20250717-070112-marostegui.json [07:01:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [07:02:00] Me too [07:02:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79294 and previous config saved to /var/cache/conftool/dbconfig/20250717-070211-root.json [07:02:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:02:36] (03Merged) 10jenkins-bot: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [07:03:06] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1170092|ores-extension: enable revertrisk filter for simplewiki and trwiki (T395668)]] [07:03:10] T395668: [batch #1] Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668 [07:03:16] (03PS1) 10Marostegui: db1175: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170232 (https://phabricator.wikimedia.org/T399548) [07:03:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79295 and previous config saved to /var/cache/conftool/dbconfig/20250717-070338-root.json [07:04:44] o/ [07:05:19] (03CR) 10Marostegui: [C:03+2] db1175: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170232 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [07:05:41] !log gkyziridis@deploy1003 gkyziridis: Backport for [[gerrit:1170092|ores-extension: enable revertrisk filter for simplewiki and trwiki (T395668)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:06:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1175.eqiad.wmnet with reason: Maintenance [07:06:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1175 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79296 and previous config saved to /var/cache/conftool/dbconfig/20250717-070609-marostegui.json [07:07:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:09:19] !log gkyziridis@deploy1003 gkyziridis: Continuing with sync [07:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:12:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79297 and previous config saved to /var/cache/conftool/dbconfig/20250717-071201-root.json [07:13:03] 06SRE, 06Commons, 10MediaWiki-Uploading, 06Traffic: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#11011982 (10Underbar_dk) I have not seen similar problems in other sites, but I have not had the opportunity to test with Commons either, unfortunately. [07:15:23] (03CR) 10Wangombe: [C:03+1] CX: Remove unused config related to database and cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:16:31] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170092|ores-extension: enable revertrisk filter for simplewiki and trwiki (T395668)]] (duration: 13m 25s) [07:16:35] T395668: [batch #1] Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668 [07:16:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79298 and previous config saved to /var/cache/conftool/dbconfig/20250717-071642-root.json [07:17:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79299 and previous config saved to /var/cache/conftool/dbconfig/20250717-071717-root.json [07:18:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79300 and previous config saved to /var/cache/conftool/dbconfig/20250717-071844-root.json [07:19:53] (03CR) 10Dreamrimmer: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [07:20:01] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [07:20:07] folks I am finished with my deployment. Feel free to proceed. Thnx [07:20:30] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [07:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [07:22:15] (03PS1) 10Elukey: Revert^2 "services: configure tegola in codfw to use maps-test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170234 [07:24:34] (03CR) 10Elukey: [C:03+2] Revert^2 "services: configure tegola in codfw to use maps-test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170234 (owner: 10Elukey) [07:26:46] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for olliekryva - https://phabricator.wikimedia.org/T399803 (10OKryva-WMF) 03NEW [07:27:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79301 and previous config saved to /var/cache/conftool/dbconfig/20250717-072709-root.json [07:28:05] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [07:28:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:30:20] georgekyz, would you have time to help deploy my change? [07:30:39] yeap sure [07:30:43] thanks! [07:30:52] Here's the patch: 1168757: CX: Remove unused config related to database and cluster | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1168757 [07:31:14] (03PS1) 10Elukey: services: set user tegola for Tegola's codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170235 (https://phabricator.wikimedia.org/T381565) [07:31:45] I see this as the next deployment: Hide on Rosie (Hide_on_rosie) [07:31:45] [config] 1169603 (Deploy change) Create "abusefilter" editor user group for Vietnamese Wikipedia - task T399535 [07:31:45] T399535: Create "abusefilter" user group for Vietnamese Wikipedia (vi.wikipedia.org) - https://phabricator.wikimedia.org/T399535 [07:31:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79302 and previous config saved to /var/cache/conftool/dbconfig/20250717-073147-root.json [07:32:04] https://www.irccloud.com/pastebin/QUVqbFuf/ [07:32:05] (03CR) 10Tacsipacsi: php8.1-cli: introduce opcache and JIT (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294) (owner: 10Effie Mouzeli) [07:32:11] Hello, I'm here [07:32:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79303 and previous config saved to /var/cache/conftool/dbconfig/20250717-073223-root.json [07:32:41] Hide_on_rosie: do you want to proceed with yours? [07:32:52] and then I can help @abijeet [07:33:02] yes, thanks [07:33:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:34:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [07:34:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:35:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T399249)', diff saved to https://phabricator.wikimedia.org/P79304 and previous config saved to /var/cache/conftool/dbconfig/20250717-073506-marostegui.json [07:35:08] abijeet: I'm also around if you need help. [07:35:10] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:35:40] (03CR) 10Elukey: [C:03+2] services: set user tegola for Tegola's codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170235 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [07:37:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:38:11] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [07:38:14] kart_: georgekyz: hello, can you help with my change? [07:38:30] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [07:38:49] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [07:39:07] I was looking the patch from @abijeet [07:39:51] georgekyz: go ahead. I'm on bad network and shouldn't be deploy. [07:40:21] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for olliekryva - https://phabricator.wikimedia.org/T399803#11012023 (10SCherukuwada) I approve of this request. [07:41:13] alright I will go first with @abijeet patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1168757 [07:41:20] is anybody around as well? [07:42:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:42:11] hey [07:42:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79305 and previous config saved to /var/cache/conftool/dbconfig/20250717-074214-root.json [07:42:54] georgekyz, Here's the patch: 1168757: CX: Remove unused config related to database and cluster | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1168757 [07:43:16] I starting deployment of that one, I was checking the patch [07:43:28] it seems ok, lets see [07:43:29] starting now [07:43:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by gkyziridis@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:44:37] (03Merged) 10jenkins-bot: CX: Remove unused config related to database and cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168757 (https://phabricator.wikimedia.org/T348513) (owner: 10Abijeet Patro) [07:44:40] abijeet: so the only thing you are doing is to remove the configs for translation cluster and the database ? [07:44:44] right ? [07:44:59] georgekyz, yup, we should just check that CX still functions after thhis [07:45:00] !log gkyziridis@deploy1003 Started scap sync-world: Backport for [[gerrit:1168757|CX: Remove unused config related to database and cluster (T348513)]] [07:45:12] T348513: Migrate ContentTranslation to use a virtual database domain - https://phabricator.wikimedia.org/T348513 [07:45:19] alright stay around for testing it [07:45:24] ok [07:45:35] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for olliekryva - https://phabricator.wikimedia.org/T399803#11012027 (10OKryva-WMF) 05Open→03Invalid [07:45:36] the deployment started [07:45:57] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for olliekryva - https://phabricator.wikimedia.org/T399803#11012028 (10OKryva-WMF) Requested through https://idm.wikimedia.org/permissions/ instead. [07:46:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79306 and previous config saved to /var/cache/conftool/dbconfig/20250717-074653-root.json [07:47:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [07:47:19] !log gkyziridis@deploy1003 gkyziridis, abi: Backport for [[gerrit:1168757|CX: Remove unused config related to database and cluster (T348513)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:47:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1255 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79307 and previous config saved to /var/cache/conftool/dbconfig/20250717-074728-root.json [07:48:25] abijeet: now is the time to test it [07:48:39] I am not clicking sync [07:48:47] georgekyz: how long does it take [07:48:55] georgekyz, ok, on it [07:49:54] Hide_on_rosie: when @abijeet finish testing it will take around 5 mins. If something is going wrong then we need to revert it and deploy the reverted version which means more time. [07:49:59] (03PS1) 10Effie Mouzeli: prometheus::ops update nginx-exporter port [puppet] - 10https://gerrit.wikimedia.org/r/1170245 [07:50:06] thanks [07:50:20] (03PS2) 10Effie Mouzeli: prometheus::ops update nginx-exporter port [puppet] - 10https://gerrit.wikimedia.org/r/1170245 [07:50:27] Hide_on_rosie: your patch seems to be kinda bigger and I need first to review it, I cannot take responsibility to deploy something without review it [07:51:27] sure, go ahead :) [07:54:13] abijeet: how are you testing this? [07:54:26] georgekyz, need 1 more minute [07:54:40] no worries just asking take your time [07:55:09] georgekyz, i think we are good [07:55:26] how can I test it as well? [07:57:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2205 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79308 and previous config saved to /var/cache/conftool/dbconfig/20250717-075720-root.json [07:57:48] georgekyz, go to Special:ContentTranslation and start translating an article, you can try publishing it to your namespace [07:58:08] (on any wikipedia) [08:00:05] dancy and andre: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T0800). [08:00:17] abijeet: it seems that it is working. I can just click on the text and see the automated translation in the right [08:00:33] yup [08:00:33] abijeet: are we good to go? Click Sync? Do you need extra testing ? [08:01:39] georgekyz, yup we can sync [08:01:47] alrighty ! [08:01:51] !log gkyziridis@deploy1003 gkyziridis, abi: Continuing with sync [08:02:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79309 and previous config saved to /var/cache/conftool/dbconfig/20250717-080159-root.json [08:02:40] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:02:55] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:05:26] (03CR) 10Effie Mouzeli: [C:03+2] prometheus::ops update nginx-exporter port [puppet] - 10https://gerrit.wikimedia.org/r/1170245 (owner: 10Effie Mouzeli) [08:07:15] !log gkyziridis@deploy1003 Finished scap sync-world: Backport for [[gerrit:1168757|CX: Remove unused config related to database and cluster (T348513)]] (duration: 22m 15s) [08:07:20] T348513: Migrate ContentTranslation to use a virtual database domain - https://phabricator.wikimedia.org/T348513 [08:07:38] Patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1168757 [08:07:44] deployed successfully! [08:08:13] congrats @abijeet [08:08:30] georgekyz, thank you. I'll do another sanity check [08:08:36] yes please [08:09:04] abijeet: if you see something going wrong please create a revert patch and schedule it for deployment asap [08:09:22] let me know if everything is fine please :P [08:11:11] Hi, are you all done [08:11:18] georgekyz, looks ok. [08:12:13] abijeet: thnx a lot for sharing! congrats! [08:12:58] Hide_on_rosie: we are finished with @abijeet patch [08:13:25] okay, what I have to do now [08:13:34] my WikimediaDebug is ready [08:14:41] Hide_on_rosie: the deployment window has already came to an end. I did not have the time to review your patch yet :( [08:14:42] FIRING: [2x] JobUnavailable: Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:15:09] :( [08:15:21] Hide_on_rosie: it would be good to reschedule it for the next time window, and find someone to review it [08:15:26] and deploy it [08:15:36] okay [08:17:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [08:19:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:25:32] (03PS1) 10Vgutierrez: pyrra: Limit istio latency SLI queries to a single app [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) [08:25:47] georgekyz: may I ask [08:25:51] what is your timezone [08:26:30] UTC+3, right now time is 11:26 in the morning [08:27:40] Hide_on_rosie: I would suggest to find another deployer who will be available because I am kinda busy with other tasks and meetings today. I am not sure if @kart_ would be available to help you [08:28:07] okay [08:28:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) (owner: 10Vgutierrez) [08:37:19] (03CR) 10Marostegui: [C:03+1] "Yeah, I normally just use -t." [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans) [08:40:21] (03PS2) 10Vgutierrez: pyrra: Limit istio SLI queries to a single app [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) [08:42:13] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) (owner: 10Vgutierrez) [08:43:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T399249)', diff saved to https://phabricator.wikimedia.org/P79310 and previous config saved to /var/cache/conftool/dbconfig/20250717-084308-marostegui.json [08:43:14] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:43:30] (03CR) 10Elukey: [C:03+1] "Looks great, thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) (owner: 10Vgutierrez) [08:46:16] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11012208 (10elukey) Adding some issues that I found when moving Tegola to the maps-test2* cluster, so I don't forget: - For some reason the tegola user had the wrong password set, I... [08:47:29] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11012262 (10elukey) [08:50:25] (03CR) 10Vgutierrez: [C:03+2] pyrra: Limit istio SLI queries to a single app [puppet] - 10https://gerrit.wikimedia.org/r/1170271 (https://phabricator.wikimedia.org/T398534) (owner: 10Vgutierrez) [08:57:22] (03PS2) 10Arnaudb: gerrit: fix scraping on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1170275 (https://phabricator.wikimedia.org/T398854) [08:58:06] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170278 [08:58:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P79311 and previous config saved to /var/cache/conftool/dbconfig/20250717-085815-marostegui.json [08:59:19] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170278 (owner: 10PipelineBot) [09:00:55] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170278 (owner: 10PipelineBot) [09:11:42] (03PS4) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [09:12:58] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:13:01] (03PS1) 10Brouberol: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399779) [09:13:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P79312 and previous config saved to /var/cache/conftool/dbconfig/20250717-091323-marostegui.json [09:13:50] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:14:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11012409 (10elukey) @Jclark-ctr Hi! I think that these servers don't have the calvin password set up (sigh), so I'd need the BMC passwords to test a new version of the... [09:18:39] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170277 (https://phabricator.wikimedia.org/T398689) [09:19:45] (03PS2) 10Brouberol: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778) [09:19:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:19:52] (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170277 (https://phabricator.wikimedia.org/T398689) (owner: 10Jakob) [09:24:13] (03CR) 10Btullis: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778) (owner: 10Brouberol) [09:24:23] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:24:53] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:27:13] (03CR) 10Brouberol: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778) (owner: 10Brouberol) [09:27:17] (03PS3) 10Brouberol: site: assign the insetup::data_platform_ferm role to dse-k8s-worker1014 [puppet] - 10https://gerrit.wikimedia.org/r/1170279 (https://phabricator.wikimedia.org/T399778) [09:28:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T399249)', diff saved to https://phabricator.wikimedia.org/P79313 and previous config saved to /var/cache/conftool/dbconfig/20250717-092831-marostegui.json [09:28:37] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:28:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [09:28:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T399249)', diff saved to https://phabricator.wikimedia.org/P79314 and previous config saved to /var/cache/conftool/dbconfig/20250717-092854-marostegui.json [09:32:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11012447 (10cmooney) Arelion came back to say they no longer see CRC errrors on their side: ` Please note we are not detecting errors in our interface on Dallas e... [09:32:15] (03CR) 10Jakob: [C:03+2] "deploying now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170277 (https://phabricator.wikimedia.org/T398689) (owner: 10Jakob) [09:33:53] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170277 (https://phabricator.wikimedia.org/T398689) (owner: 10Jakob) [09:34:39] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [09:34:54] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:35:22] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [09:35:41] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [09:36:02] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [09:36:18] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [09:40:14] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11012461 (10cmooney) Arelion have delcared the situation is resolved: ` 7/17/2025 9:00:40 AM Cause of Outage: This incident initially originated under a separ... [09:44:01] (03PS1) 10Vgutierrez: acme_chief: Remove certs older than 1 year [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) [09:44:42] (03CR) 10CI reject: [V:04-1] acme_chief: Remove certs older than 1 year [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez) [09:46:01] (03PS1) 10Tiziano Fogli: prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) [09:46:26] (03CR) 10CI reject: [V:04-1] prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [09:49:25] (03PS2) 10Tiziano Fogli: prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) [09:50:20] (03PS6) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) [09:50:28] (03PS2) 10Vgutierrez: acme_chief: Remove certs older than 1 year [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) [09:50:46] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:36] (03CR) 10Stang: zhwiki: Allow local securepoll setup (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [09:52:01] (03CR) 10Stang: "Resolved" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [09:52:43] (03PS7) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) [09:53:10] FIRING: GanetiBGPDown: BGP session down between ganeti2034 and lsw1-a4-codfw - group Ganeti6 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a4-codfw:9804&var-bgp_group=Ganeti6&var-bgp_neighbor=ganeti2034 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [09:53:22] (03PS8) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) [09:56:17] (03CR) 10Arnaudb: [C:03+2] gerrit: fix scraping on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1170275 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:44] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez) [09:57:54] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [09:58:10] RESOLVED: GanetiBGPDown: BGP session down between ganeti2034 and lsw1-a4-codfw - group Ganeti6 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a4-codfw:9804&var-bgp_group=Ganeti6&var-bgp_neighbor=ganeti2034 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [09:58:10] (03CR) 10Tiziano Fogli: [C:03+2] prometheus::pop: manage pop Prometheus instances centrally [puppet] - 10https://gerrit.wikimedia.org/r/1170282 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1000) [10:09:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:11:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T399249)', diff saved to https://phabricator.wikimedia.org/P79315 and previous config saved to /var/cache/conftool/dbconfig/20250717-101156-marostegui.json [10:12:02] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:13:34] (03PS2) 10Ayounsi: Ganeti Bird BGP [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) [10:14:13] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:14:53] !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 30182 [10:15:55] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:16:15] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:16:15] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 30182 [10:16:22] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:18:04] (03PS5) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [10:18:49] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:20:26] (03PS1) 10Tiziano Fogli: prom/metamonitor: simplify PQL query to retrieve instance list [puppet] - 10https://gerrit.wikimedia.org/r/1170286 (https://phabricator.wikimedia.org/T397003) [10:23:16] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:23:30] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1170286 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:23:38] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:23:58] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:24:03] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:24:46] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:24:52] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [10:25:01] (03CR) 10Ayounsi: Ganeti Bird BGP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [10:25:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:26:41] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: simplify PQL query to retrieve instance list [puppet] - 10https://gerrit.wikimedia.org/r/1170286 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:27:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P79316 and previous config saved to /var/cache/conftool/dbconfig/20250717-102704-marostegui.json [10:27:35] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:28:55] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [10:32:00] (03PS1) 10FNegri: admin: migrate fnegri to sk-ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/1170287 [10:32:45] (03PS2) 10FNegri: admin: migrate fnegri to sk-ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/1170287 [10:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:40:35] (03PS1) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) [10:40:50] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#11012714 (10brouberol) [10:40:51] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#11012715 (10brouberol) 05Open→03In progress [10:42:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P79317 and previous config saved to /var/cache/conftool/dbconfig/20250717-104211-marostegui.json [10:42:24] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Data-Platform-SRE (2025.07.05 - 2025.07.25), 13Patch-For-Review: Proposal: adding a kafka admin client to spicerack - https://phabricator.wikimedia.org/T399069#11012730 (10brouberol) a:03brouberol [10:47:25] (03PS6) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [10:48:25] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:48:35] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:48:57] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [10:49:03] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [10:49:08] (03CR) 10CI reject: [V:04-1] Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [10:49:19] (03PS7) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [10:49:41] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:49:45] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [10:50:19] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [10:51:01] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [10:52:07] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:52:09] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [10:52:17] (03CR) 10Ayounsi: "PCC shows some `neighbor external;` I *think* that it's because of PCC and it would be fine in prod, but to be double checked." [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [10:52:49] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [10:52:59] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:54:08] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:54:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:55:15] (03PS1) 10Btullis: Tweak the java options for hive-metastore on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170293 (https://phabricator.wikimedia.org/T399711) [10:55:17] (03PS1) 10Btullis: Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711) [10:55:41] (03PS2) 10Btullis: Tweak the java options for hive-metastore on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170293 (https://phabricator.wikimedia.org/T399711) [10:55:56] (03PS2) 10Btullis: Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711) [10:56:00] (03CR) 10CI reject: [V:04-1] Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [10:56:07] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:57:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T399249)', diff saved to https://phabricator.wikimedia.org/P79318 and previous config saved to /var/cache/conftool/dbconfig/20250717-105719-marostegui.json [10:57:23] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:57:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [10:57:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T399249)', diff saved to https://phabricator.wikimedia.org/P79319 and previous config saved to /var/cache/conftool/dbconfig/20250717-105741-marostegui.json [11:00:13] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:02:46] (03PS1) 10Marostegui: db1166: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170297 (https://phabricator.wikimedia.org/T399548) [11:03:34] (03CR) 10Marostegui: [C:03+2] db1166: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170297 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [11:03:39] RECOVERY - Squid on install1004 is OK: TCP OK - 0.007 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [11:03:59] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:04:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:04:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1166 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79320 and previous config saved to /var/cache/conftool/dbconfig/20250717-110405-marostegui.json [11:04:30] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Ping me if you've any issues" [puppet] - 10https://gerrit.wikimedia.org/r/1170287 (owner: 10FNegri) [11:05:19] (03PS1) 10Marostegui: db2227: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170300 (https://phabricator.wikimedia.org/T399548) [11:06:00] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1170293 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [11:06:43] (03CR) 10Stevemunene: [C:03+1] Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [11:08:42] !log elukey@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM install1004.wikimedia.org [11:09:01] (03CR) 10Jelto: [C:03+1] gerrit: fix scraping on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1170275 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [11:09:08] RESOLVED: [2x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:10:17] (03CR) 10Marostegui: [C:03+2] db2227: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170300 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [11:11:28] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2227.codfw.wmnet with reason: Maintenance [11:11:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2227 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79321 and previous config saved to /var/cache/conftool/dbconfig/20250717-111132-marostegui.json [11:13:27] RECOVERY - MegaRAID on backup1007 is OK: OK: optimal, 1 logical, 24 physical https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:14:27] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:14:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:14:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79323 and previous config saved to /var/cache/conftool/dbconfig/20250717-111454-root.json [11:15:15] !log elukey@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install1004.wikimedia.org [11:16:21] (03CR) 10Marostegui: "I don't find any explanation for this: https://phabricator.wikimedia.org/P79324" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [11:17:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance [11:17:36] !log Restart pc4 T399540 [11:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:42] T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540 [11:19:39] (03PS1) 10Stevemunene: hdfs: Add an-worker 1176|1179|1186 to analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170301 (https://phabricator.wikimedia.org/T398027) [11:22:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79325 and previous config saved to /var/cache/conftool/dbconfig/20250717-112220-root.json [11:22:27] !log stevemunene@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1176.eqiad.wmnet [11:23:05] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:23:31] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:24:18] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:24:18] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1176.eqiad.wmnet [11:24:36] !log stevemunene@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1179.eqiad.wmnet [11:24:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:25:51] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [11:26:53] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1179.eqiad.wmnet [11:27:05] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:27:13] !log stevemunene@cumin1003 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1186.eqiad.wmnet [11:28:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:29:17] FIRING: [3x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:30:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79326 and previous config saved to /var/cache/conftool/dbconfig/20250717-113000-root.json [11:30:14] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1186.eqiad.wmnet [11:33:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:34:17] RESOLVED: [6x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79327 and previous config saved to /var/cache/conftool/dbconfig/20250717-113726-root.json [11:38:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:38:58] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1012.eqiad.wmnet with OS bookworm [11:41:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [11:41:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [11:41:40] (03CR) 10Marostegui: "Never mind this, I was using the wrong order!. All works fine" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [11:42:06] (03CR) 10Marostegui: "This is still something we should try to improve" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [11:42:12] (03PS1) 10Arthur taylor: Enable wbui2025 mobile user interface on Wikidata Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) [11:43:41] (03PS2) 10Arthur taylor: Enable wbui2025 mobile user interface on Wikidata Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170304 (https://phabricator.wikimedia.org/T399703) [11:44:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:45:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79329 and previous config saved to /var/cache/conftool/dbconfig/20250717-114506-root.json [11:45:40] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:50:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T399355#11012932 (10Jclark-ctr) 05Open→03Resolved Replaced Failed Drive Thanks for the assistance with this @BTullis [11:52:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79330 and previous config saved to /var/cache/conftool/dbconfig/20250717-115232-root.json [11:59:55] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1200) [12:00:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79332 and previous config saved to /var/cache/conftool/dbconfig/20250717-120014-root.json [12:02:40] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:04:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T399249)', diff saved to https://phabricator.wikimedia.org/P79333 and previous config saved to /var/cache/conftool/dbconfig/20250717-120444-marostegui.json [12:04:49] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:05:24] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:05:36] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:07:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79334 and previous config saved to /var/cache/conftool/dbconfig/20250717-120738-root.json [12:10:29] (03CR) 10Effie Mouzeli: [C:03+1] "LGTM! Thanks" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French) [12:13:17] RECOVERY - MinIO server processes on backup1007 is OK: PROCS OK: 1 process with command name minio, args server https://wikitech.wikimedia.org/wiki/Media_storage/Backups [12:18:54] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11013011 (10jcrespo) I told @Jclark-ctr not to replace the 13th disk yet, as I was more worried about the jbod ones than the RAID: ` root@backup1007:~$ megacli -PDList -aall | grep rro Media Error Count: 0 O... [12:19:13] (03CR) 10Btullis: [C:03+1] hdfs: Add an-worker 1176|1179|1186 to analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170301 (https://phabricator.wikimedia.org/T398027) (owner: 10Stevemunene) [12:19:38] (03CR) 10Btullis: [C:03+2] Tweak the java options for hive-metastore on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170293 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [12:19:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P79335 and previous config saved to /var/cache/conftool/dbconfig/20250717-121952-marostegui.json [12:20:54] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11013013 (10jcrespo) Note my prediction is that we will need 3 new disks, not only 1 to be replaced (but this can be resolve for now). [12:21:40] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11013021 (10Jclark-ctr) 05Open→03Resolved Updated Firmware on idrac while logged in thanks for assistance @jcrespo [12:23:16] (03PS1) 10Jforrester: PendingChangesPager: Stop using ANSI-89 joins [extensions/FlaggedRevs] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170318 (https://phabricator.wikimedia.org/T399641) [12:26:54] (03CR) 10Jcrespo: [C:03+2] "Icinga now says: "communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK"" [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo) [12:28:25] (03CR) 10FNegri: [C:03+2] admin: migrate fnegri to sk-ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/1170287 (owner: 10FNegri) [12:30:12] (03CR) 10Stevemunene: [C:03+2] hdfs: Add an-worker 1176|1179|1186 to analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170301 (https://phabricator.wikimedia.org/T398027) (owner: 10Stevemunene) [12:30:14] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170326 [12:35:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P79336 and previous config saved to /var/cache/conftool/dbconfig/20250717-123459-marostegui.json [12:35:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:36:36] (03CR) 10Btullis: [C:03+2] Apply the hive-metastore GC changes to production [puppet] - 10https://gerrit.wikimedia.org/r/1170294 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [12:36:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:41:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:43:21] (03PS2) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) [12:50:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T399249)', diff saved to https://phabricator.wikimedia.org/P79337 and previous config saved to /var/cache/conftool/dbconfig/20250717-125007-marostegui.json [12:50:13] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:50:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:50:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T399249)', diff saved to https://phabricator.wikimedia.org/P79338 and previous config saved to /var/cache/conftool/dbconfig/20250717-125029-marostegui.json [12:51:45] (03CR) 10CI reject: [V:04-1] Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [12:53:44] (03PS3) 10Brouberol: Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) [12:54:43] (03PS1) 10Btullis: "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1170331 [12:58:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:59:20] (03CR) 10Btullis: [C:03+2] "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1170331 (owner: 10Btullis) [12:59:31] !log btullis@dns1004 START - running authdns-update [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1300). [13:00:05] joelyrookewmde and Hide_on_rosie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] \o/ [13:00:26] !log btullis@dns1004 END - running authdns-update [13:00:51] I can probably deploy in 15 minutes or so but not yet :) [13:01:05] oh no :( [13:03:39] (03CR) 10CI reject: [V:04-1] Kafka: expose a KafkaAdminClient builder method [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [13:04:24] (03CR) 10Brouberol: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170289 (https://phabricator.wikimedia.org/T399069) (owner: 10Brouberol) [13:09:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:09:33] o/ [13:09:35] now I can deploy ^^ [13:09:45] hi [13:10:18] we are here for the T388685 [13:10:19] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [13:11:06] and I'm here for T399535 [13:11:07] T399535: Create "abusefilter" user group for Vietnamese Wikipedia (vi.wikipedia.org) - https://phabricator.wikimedia.org/T399535 [13:11:20] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Activate feature to resolve changelist wikibase link labels in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [13:11:34] whoa that’s a lot of “PHP Deprecated” in logspam-watch [13:11:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [13:12:25] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Activate feature to resolve changelist wikibase link labels in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [13:12:48] (03Merged) 10jenkins-bot: Activate feature to resolve changelist wikibase link labels in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169077 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [13:13:12] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1169077|Activate feature to resolve changelist wikibase link labels in all wikis (T388685)]] [13:14:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:15:26] !log lucaswerkmeister-wmde@deploy1003 joelyrookewmde, lucaswerkmeister-wmde: Backport for [[gerrit:1169077|Activate feature to resolve changelist wikibase link labels in all wikis (T388685)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:15:30] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [13:16:02] joelyrookewmde, suzannewoodWMDE6: please test :) [13:16:18] can do, but I can't see any 1001 or 1002 servers in the extension [13:16:23] which should we use for testing? [13:16:53] (03CR) 10Ssingh: [C:03+1] "Looks good but question: is there a reason you want to remove older than 365 days and a smaller interval like 6 months or something, given" [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez) [13:18:20] Lucas_WMDE it' [13:18:24] it's working [13:18:51] joelyrookewmde: k8s-mwdebug is the one you should be using these days [13:18:53] (03CR) 10Ssingh: [C:03+1] "*and _not_ a shorter interval like 6 months" [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez) [13:19:14] !log lucaswerkmeister-wmde@deploy1003 joelyrookewmde, lucaswerkmeister-wmde: Continuing with sync [13:19:18] (03PS1) 10Jforrester: [metawiki] Set site name to 'Meta-Wiki', not just 'Meta' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170339 (https://phabricator.wikimedia.org/T399843) [13:19:51] (03CR) 10Jforrester: [C:04-2] "Waiting for community consensus first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170339 (https://phabricator.wikimedia.org/T399843) (owner: 10Jforrester) [13:20:11] (03PS1) 10Andrew Bogott: cloudceph osd.yaml: update nic names for 1006 [puppet] - 10https://gerrit.wikimedia.org/r/1170341 (https://phabricator.wikimedia.org/T399281) [13:20:54] (03CR) 10David Caro: [C:03+1] cloudceph osd.yaml: update nic names for 1006 [puppet] - 10https://gerrit.wikimedia.org/r/1170341 (https://phabricator.wikimedia.org/T399281) (owner: 10Andrew Bogott) [13:24:10] Lucas_WMDE Thanks for helping us with the deployment! [13:24:16] np :) [13:24:43] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169077|Activate feature to resolve changelist wikibase link labels in all wikis (T388685)]] (duration: 11m 30s) [13:24:47] T388685: Show labels for properties and items on Wikipedia watchlist summaries - https://phabricator.wikimedia.org/T388685 [13:24:53] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM (AFAICT urbanecm’s concern was addressed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [13:25:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [13:26:11] (03Merged) 10jenkins-bot: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [13:26:34] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1169603|Create "abusefilter" editor user group for Vietnamese Wikipedia (T399535)]] [13:26:39] T399535: Create "abusefilter" user group for Vietnamese Wikipedia (vi.wikipedia.org) - https://phabricator.wikimedia.org/T399535 [13:27:11] Thanks Lucas_WMDE: [13:28:35] (03PS1) 10David Caro: prometheus-node-pinger: fix the script to return 1 on failure [puppet] - 10https://gerrit.wikimedia.org/r/1170342 (https://phabricator.wikimedia.org/T399281) [13:28:44] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, tryvix1509: Backport for [[gerrit:1169603|Create "abusefilter" editor user group for Vietnamese Wikipedia (T399535)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:28:57] (03CR) 10Vgutierrez: "no good reason besides erring on the cautious side of things" [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez) [13:31:07] Hide_on_rosie: please test :) [13:32:46] (03CR) 10Andrew Bogott: [C:03+1] prometheus-node-pinger: fix the script to return 1 on failure [puppet] - 10https://gerrit.wikimedia.org/r/1170342 (https://phabricator.wikimedia.org/T399281) (owner: 10David Caro) [13:33:53] https://vi.wikipedia.org/w/index.php?title=%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_nh%C3%B3m_ng%C6%B0%E1%BB%9Di_d%C3%B9ng&uselang=vi looks good to me FWIW (the abusefilter group gets four rights: changetags, managechangetags, abusefilter-modify, oathauth-enable [13:33:56] ) [13:34:02] seems ok [13:34:08] https://usercontent.irccloud-cdn.com/file/Esn3BO1r/image.png [13:34:09] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, tryvix1509: Continuing with sync [13:34:11] ok! [13:35:09] (03CR) 10Ssingh: [C:03+1] "OK, I guess 365 days is definitely a start, so +1." [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez) [13:35:17] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170345 [13:35:32] (03CR) 10Kosta Harlan: [C:03+1] Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) (owner: 10Dreamy Jazz) [13:35:40] https://usercontent.irccloud-cdn.com/file/NwZsydTQ/image.png [13:36:43] (03CR) 10David Caro: "Tested:" [puppet] - 10https://gerrit.wikimedia.org/r/1170342 (https://phabricator.wikimedia.org/T399281) (owner: 10David Caro) [13:36:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T399249)', diff saved to https://phabricator.wikimedia.org/P79340 and previous config saved to /var/cache/conftool/dbconfig/20250717-133641-marostegui.json [13:36:45] (03CR) 10David Caro: [C:03+2] prometheus-node-pinger: fix the script to return 1 on failure [puppet] - 10https://gerrit.wikimedia.org/r/1170342 (https://phabricator.wikimedia.org/T399281) (owner: 10David Caro) [13:36:48] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:37:17] (03CR) 10Andrew Bogott: [C:03+2] cloudceph osd.yaml: update nic names for 1006 [puppet] - 10https://gerrit.wikimedia.org/r/1170341 (https://phabricator.wikimedia.org/T399281) (owner: 10Andrew Bogott) [13:38:42] Lucas_WMDE: Since this is my first commit to gerrit, I would like to ask whether does it sync to beta cluster? [13:39:47] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169603|Create "abusefilter" editor user group for Vietnamese Wikipedia (T399535)]] (duration: 13m 12s) [13:39:51] T399535: Create "abusefilter" user group for Vietnamese Wikipedia (vi.wikipedia.org) - https://phabricator.wikimedia.org/T399535 [13:40:54] yes, it will deploy to the beta cluster automatically [13:40:57] usually within ten minutes [13:41:11] Okay, thanks for your help [13:42:08] I can already see it at https://vi.wikipedia.beta.wmcloud.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_nh%C3%B3m_ng%C6%B0%E1%BB%9Di_d%C3%B9ng :) [13:42:26] :oo [13:43:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11013297 (10elukey) Ok so I have a provision script change that seems to work, but it doesn't touch anything on the network PXE / FixedBootOrder config (except ensuring... [13:44:41] https://usercontent.irccloud-cdn.com/file/JYWLAhSY/IMG_4264.PNG [13:44:46] (03PS1) 10Andrew Bogott: cloudceph osd.yaml: update nic names for 1006 again [puppet] - 10https://gerrit.wikimedia.org/r/1170346 (https://phabricator.wikimedia.org/T399281) [13:44:55] Lucas_WMDE: Why does it have only 3 rights [13:45:01] on beta cluster [13:45:56] (03CR) 10Andrew Bogott: [C:03+2] cloudceph osd.yaml: update nic names for 1006 again [puppet] - 10https://gerrit.wikimedia.org/r/1170346 (https://phabricator.wikimedia.org/T399281) (owner: 10Andrew Bogott) [13:46:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:47:28] hmm [13:47:34] that’s a fair question Hide_on_rosie [13:47:43] I guess the magic code adding the oathauth stuff isn’t active on beta? [13:48:18] which sounds unfortunate because it’s definitely still useful on beta (cf. T396061) [13:48:19] T396061: Groups requiring 2FA via $wgOATHRequiredForGroups do not clearly warn users without 2FA that their permissions were truncated - https://phabricator.wikimedia.org/T396061 [13:48:23] * Lucas_WMDE looks a bit [13:49:13] hmm [13:49:38] aha [13:49:44] on beta, *everyone* has the oathauth-enable right [13:49:56] therefore there’s no need to give it to the $wmgPriviligedGroups in addition to that [13:50:12] you can see it in the Thành viên thông thường (user) group at https://vi.wikipedia.beta.wmcloud.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_nh%C3%B3m_ng%C6%B0%E1%BB%9Di_d%C3%B9ng [13:50:23] oh, nice [13:50:44] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/92d68cca33cec238d9577899f35f21045628c835/wmf-config/CommonSettings.php#4024 is the code that reassigns the oathauth-enable right from user to privileged groups on production [13:51:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:51:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P79341 and previous config saved to /var/cache/conftool/dbconfig/20250717-135150-marostegui.json [13:53:28] PROBLEM - MegaRAID on backup1007 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:53:30] ACKNOWLEDGEMENT - MegaRAID on backup1007 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T399847 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:53:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847 (10ops-monitoring-bot) 03NEW [13:56:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:06:16] (03PS1) 10Btullis: Revert ""Fail over hive services to an-coord1004"" [dns] - 10https://gerrit.wikimedia.org/r/1170351 [14:06:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:06:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P79342 and previous config saved to /var/cache/conftool/dbconfig/20250717-140658-marostegui.json [14:07:55] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:09:42] jouncebot: nowandnext [14:09:43] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [14:09:43] In 0 hour(s) and 20 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1430) [14:10:19] (03PS1) 10Kosta Harlan: Prevent submissions of forms using hCaptcha until ready [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170352 (https://phabricator.wikimedia.org/T395619) [14:22:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T399249)', diff saved to https://phabricator.wikimedia.org/P79343 and previous config saved to /var/cache/conftool/dbconfig/20250717-142205-marostegui.json [14:22:10] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:22:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [14:22:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T399249)', diff saved to https://phabricator.wikimedia.org/P79344 and previous config saved to /var/cache/conftool/dbconfig/20250717-142228-marostegui.json [14:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1430) [14:31:19] (03CR) 10Vgutierrez: [C:03+2] acme_chief: Remove certs older than 1 year [puppet] - 10https://gerrit.wikimedia.org/r/1170281 (https://phabricator.wikimedia.org/T399419) (owner: 10Vgutierrez) [14:34:36] (03PS1) 10Tiziano Fogli: prom/metamonitor: hide DeadManSwitch alerts in Karma [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003) [14:35:49] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170173 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:35:50] (03PS3) 10Ayounsi: Ganeti Bird BGP [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) [14:35:56] (03CR) 10Scott French: [C:03+2] shellbox: revert to httpd-fcgi image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170173 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:36:42] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:37:36] (03PS1) 10Eevans: data-gateway-staging: use hostname (for SNI probe reqs) [puppet] - 10https://gerrit.wikimedia.org/r/1170361 (https://phabricator.wikimedia.org/T399856) [14:38:19] !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: depool eqsin to test backhaul cct packet loss, T399221] [14:38:22] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: depool eqsin to test backhaul cct packet loss, T399221] [14:38:23] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [14:38:31] (03Merged) 10jenkins-bot: shellbox: revert to httpd-fcgi image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170173 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [14:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:40:03] (03CR) 10Tiziano Fogli: "Patch ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [14:40:29] (03CR) 10Filippo Giunchedi: [C:03+1] data-gateway-staging: use hostname (for SNI probe reqs) [puppet] - 10https://gerrit.wikimedia.org/r/1170361 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans) [14:40:34] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [14:40:49] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [14:41:20] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [14:41:28] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [14:41:59] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [14:42:07] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [14:42:38] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [14:42:52] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [14:42:57] (03CR) 10Eevans: [C:03+2] data-gateway-staging: use hostname (for SNI probe reqs) [puppet] - 10https://gerrit.wikimedia.org/r/1170361 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans) [14:43:23] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [14:43:31] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [14:44:03] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [14:44:08] (03PS1) 10Stevemunene: dns: Add dse-k8s codfw urls [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293) [14:44:24] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [14:48:29] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [14:49:07] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [14:49:38] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [14:49:54] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [14:50:25] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [14:50:41] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [14:51:12] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [14:51:32] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [14:52:04] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [14:52:28] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [14:52:59] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [14:53:21] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [14:53:57] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:54:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:55:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:59:13] (03PS1) 10Hasan Akgün (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367 [15:00:05] dancy and andre: gettimeofday() says it's time for Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1500) [15:00:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:01:44] !log dancy@deploy1003 Installing scap version "4.189.0" for 2 host(s) [15:03:31] !log dancy@deploy1003 Installation of scap version "4.189.0" completed for 2 hosts [15:05:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:05:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T399249)', diff saved to https://phabricator.wikimedia.org/P79345 and previous config saved to /var/cache/conftool/dbconfig/20250717-150659-marostegui.json [15:07:04] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:07:26] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367 (owner: 10Hasan Akgün (WMDE)) [15:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:10:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:12:10] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:12:59] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:13:15] !log disable one of the 2x10G links connected to Equinix IXP Peering on cr1-codfw [15:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:31] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [15:13:48] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [15:14:19] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [15:14:28] jouncebot: nowandnext [15:14:28] For the next 0 hour(s) and 45 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1500) [15:14:28] In 0 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1600) [15:14:34] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [15:15:03] (03CR) 10Btullis: [C:03+2] Revert ""Fail over hive services to an-coord1004"" [dns] - 10https://gerrit.wikimedia.org/r/1170351 (owner: 10Btullis) [15:15:05] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:15:23] !log btullis@dns1004 START - running authdns-update [15:15:24] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:15:56] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [15:16:19] !log btullis@dns1004 END - running authdns-update [15:16:21] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:52] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [15:17:21] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [15:17:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170352 (https://phabricator.wikimedia.org/T395619) (owner: 10Kosta Harlan) [15:19:26] (03Merged) 10jenkins-bot: Prevent submissions of forms using hCaptcha until ready [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170352 (https://phabricator.wikimedia.org/T395619) (owner: 10Kosta Harlan) [15:19:48] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1170352|Prevent submissions of forms using hCaptcha until ready (T395619)]] [15:19:53] T395619: Prevent form submission until hCaptcha has run - https://phabricator.wikimedia.org/T395619 [15:20:27] (03PS8) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [15:21:34] kostajh: FYI your deployment will take a long time due to the l10n files being updated. [15:21:42] Yes, I know [15:21:57] Hopefully that is OK for everyone else? [15:22:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P79346 and previous config saved to /var/cache/conftool/dbconfig/20250717-152207-marostegui.json [15:22:18] Yep. No problem. Sometimes it catches people by surprise so I thought I'd mention it. [15:22:24] ack, thanks [15:23:34] (03PS2) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367 (owner: 10Hasan Akgün (WMDE)) [15:24:03] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "deploying" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367 (owner: 10Hasan Akgün (WMDE)) [15:25:47] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170367 (owner: 10Hasan Akgün (WMDE)) [15:26:07] (03CR) 10Ssingh: "Looking good, let's add a hiera to actually get a PCC output." [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:27:22] jouncebot: now [15:27:22] For the next 0 hour(s) and 32 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1500) [15:27:46] I’ll deploy an update to the wikidata query builder (helmfile.d stuff), shouldn’t affect train log triage or anything else I expect [15:27:47] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [15:28:00] !log lucaswerkmeister-wmde@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [15:28:12] !log un-drain Arelion transport circuit from codfw -> eqsin to test performance T399221 [15:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:16] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [15:28:18] (03PS5) 10Zabe: Set categorylinks to read new on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169198 (https://phabricator.wikimedia.org/T397912) [15:28:45] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [15:29:00] !log lucaswerkmeister-wmde@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [15:29:06] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [15:29:23] !log lucaswerkmeister-wmde@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [15:31:15] * Lucas_WMDE done deploying [15:35:23] (03PS1) 10Zabe: Set categorylinks to read new on remaining s2 large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170371 (https://phabricator.wikimedia.org/T397912) [15:35:52] (03PS2) 10Zabe: Set categorylinks to read new on remaining large s2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170371 (https://phabricator.wikimedia.org/T397912) [15:37:14] (03CR) 10Btullis: dns: Add dse-k8s codfw urls (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [15:37:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P79347 and previous config saved to /var/cache/conftool/dbconfig/20250717-153715-marostegui.json [15:37:52] (03CR) 10Btullis: dns: Add dse-k8s codfw urls (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1170364 (https://phabricator.wikimedia.org/T397293) (owner: 10Stevemunene) [15:39:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#11013970 (10cmooney) 05Resolved→03Open @Jclark-ctr as discussed in our call on Tuesday we will be connecting the second SFP port... [15:42:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11013994 (10cmooney) Not sure how to progress this one. Still see zero packet loss over the link, even running for a longer period (5 mins this time): ` cmooney@... [15:44:14] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1170352|Prevent submissions of forms using hCaptcha until ready (T395619)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:44:18] T395619: Prevent form submission until hCaptcha has run - https://phabricator.wikimedia.org/T395619 [15:45:32] !log aqu@deploy1003 Started deploy [airflow-dags/analytics@9fc3ae8]: Pushing new artifacts [15:45:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11014023 (10ssingh) > Arelion want to close the ticket as they see no issue. I asked that they don't. Perhaps for now we just leave eqsin depooled and the circu... [15:46:13] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics@9fc3ae8]: Pushing new artifacts (duration: 00m 41s) [15:48:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:50:04] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@9fc3ae8]: Pushing new artifacts [15:50:22] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@9fc3ae8]: Pushing new artifacts (duration: 00m 17s) [15:51:53] !log kharlan@deploy1003 kharlan: Continuing with sync [15:52:04] (03CR) 10Samtar: ":3" [puppet] - 10https://gerrit.wikimedia.org/r/1139049 (https://phabricator.wikimedia.org/T392692) (owner: 10Samtar) [15:52:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:52:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T399249)', diff saved to https://phabricator.wikimedia.org/P79348 and previous config saved to /var/cache/conftool/dbconfig/20250717-155223-marostegui.json [15:52:28] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:52:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:54:25] (03PS3) 10Scott French: php8.3: initial release of 8.3 image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) [15:56:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11014121 (10cmooney) >>! In T399221#11014023, @ssingh wrote: > I think leaving eqsin depooled given that it is off-peak there and observing this for a few hours i... [15:57:00] (03PS1) 10Jforrester: ZLangRegistry::fetchLanguageCodeFromZid: Check for invalid Title too [extensions/WikiLambda] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170376 (https://phabricator.wikimedia.org/T399755) [15:57:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:58:12] 10SRE-SLO: Reduce the pyrra's multi-dc configurations where it makes sense - https://phabricator.wikimedia.org/T398534#11014129 (10elukey) We discovered this Pyrra bug https://github.com/pyrra-dev/pyrra/issues/667 that is affecting all the SLOs that are istio based. The Pyrra UI assumes that the metrics are in s... [16:00:04] jhathaway and moritzm: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:59] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [16:02:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [16:02:22] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance [16:02:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance [16:04:35] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170352|Prevent submissions of forms using hCaptcha until ready (T395619)]] (duration: 44m 46s) [16:04:39] T395619: Prevent form submission until hCaptcha has run - https://phabricator.wikimedia.org/T395619 [16:07:55] (03CR) 10Scott French: "Thanks, Effie!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French) [16:08:24] (03CR) 10BCornwall: [C:03+1] hiera: service.yaml: use better aliasing for text/upload [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [16:10:08] jouncebot: nowandnext [16:10:08] For the next 0 hour(s) and 49 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1600) [16:10:08] In 0 hour(s) and 49 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1700) [16:10:08] In 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1700) [16:10:14] wonderful [16:10:39] (03PS1) 10Máté Szabó: Load hCaptcha on first form interaction [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170381 (https://phabricator.wikimedia.org/T399849) [16:11:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170381 (https://phabricator.wikimedia.org/T399849) (owner: 10Máté Szabó) [16:12:34] 06SRE-OnFire, 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Add paging alert when many tools are unreachable - https://phabricator.wikimedia.org/T399870#11014189 (10fnegri) [16:12:40] 06SRE-OnFire, 10Cloud-VPS, 10cloud-services-team (FY2025/26-Q1), 10Sustainability (Incident Followup): Cloud Ceph misbehaving on Debian Bookworm - https://phabricator.wikimedia.org/T399858#11014190 (10fnegri) [16:14:35] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11014191 (10Eevans) 05Open→03Resolved This is now complete. For posterity sake: We weren't able to salvage the data, the cluster was reimaged and the data on it rebuilt. [16:16:01] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014196 (10Jclark-ctr) a:03Jclark-ctr [16:16:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014198 (10Jclark-ctr) @jcrespo just fyi automated ticket was opened again for this host [16:18:59] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014215 (10jcrespo) This time if fully Failed, so please change it. Do I stop the server first? [16:21:48] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11014224 (10cmooney) Crickets in the main from Arelion, one update earlier. ` 2025-07-17 14:08 Dear Customer, Please be advised that we are seeing some errors... [16:24:38] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:25:10] (03PS1) 10Aqu: Analyics: Refine restore monitor timerange [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845) [16:25:21] (03Merged) 10jenkins-bot: Load hCaptcha on first form interaction [extensions/ConfirmEdit] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170381 (https://phabricator.wikimedia.org/T399849) (owner: 10Máté Szabó) [16:25:46] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1170381|Load hCaptcha on first form interaction (T399849)]] [16:25:47] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on backup1007.eqiad.wmnet with reason: failed disk [16:25:50] T399849: hCaptcha: Load hCaptcha JS after first form interaction - https://phabricator.wikimedia.org/T399849 [16:25:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014239 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ce8f4e27-d454-43c0-b1b5-892d46c710a6) set by jynus@cumin1003 for 1 day, 0:00:00 on 1 host(s) and their services with reason: faile... [16:26:12] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [16:26:13] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1053.eqiad.wmnet with OS bookworm [16:26:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014241 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm completed... [16:26:38] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:27:27] (03CR) 10CI reject: [V:04-1] Analyics: Refine restore monitor timerange [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [16:27:43] (03CR) 10ZhaoFJx: zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [16:28:24] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:28:44] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [16:28:48] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:29:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399847#11014245 (10jcrespo) I've stopped it anyway, if you could start it up again after finishing, it would help me a lot, thank you. [16:30:28] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1170381|Load hCaptcha on first form interaction (T399849)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:31:14] (03PS2) 10Aqu: Analyics: Refine restore monitor timerange [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845) [16:31:50] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11014254 (10Jhancock.wm) checked the physical cables and everything lines up right. couldn't get into the BMC. re-ran the reqular provisioning script and can access the BMC now. But won't let me set th... [16:33:00] !log mszabo@deploy1003 mszabo: Continuing with sync [16:40:13] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170381|Load hCaptcha on first form interaction (T399849)]] (duration: 14m 26s) [16:40:17] T399849: hCaptcha: Load hCaptcha JS after first form interaction - https://phabricator.wikimedia.org/T399849 [16:43:18] 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014324 (10Eevans) [16:49:33] 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014363 (10Eevans) Has there been any progress toward goal #2? I didn't see where anything had been added to the mentioned runbook. For context: We replaced `sda` in aqs1012 recently (T39... [16:53:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [16:53:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T399249)', diff saved to https://phabricator.wikimedia.org/P79350 and previous config saved to /var/cache/conftool/dbconfig/20250717-165345-marostegui.json [16:53:50] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:56:47] (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French) [16:58:15] (03CR) 10Kosta Harlan: [C:04-2] "Waiting for approval." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) (owner: 10Dreamy Jazz) [16:58:25] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: cloudcephosd10[48-51] service implementation - https://phabricator.wikimedia.org/T395910#11014409 (10cmooney) Regarding the jumbo-frame complication on the plan to move to one link we are arranging to connect a second 25G on each of... [17:00:05] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1700) [17:00:05] swfrench-wmf: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1700). [17:00:19] o/ [17:01:02] * bd808 looks to see if there is anything to push out today [17:01:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:05:04] !log swfrench@deploy1003 Started scap sync-world: Migrate webserver-bookworm flavour back to (bookworm) mediawiki-httpd images - T378128 [17:05:10] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:06:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:06:41] !log swfrench@deploy1003 swfrench: Migrate webserver-bookworm flavour back to (bookworm) mediawiki-httpd images - T378128 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:08:54] !log swfrench@deploy1003 swfrench: Continuing with sync [17:09:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014464 (10VRiley-WMF) ganeti1054 has moved into A4 U38 [17:09:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014466 (10VRiley-WMF) [17:14:28] !log swfrench@deploy1003 Finished scap sync-world: Migrate webserver-bookworm flavour back to (bookworm) mediawiki-httpd images - T378128 (duration: 09m 56s) [17:14:33] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:15:58] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170390 [17:18:19] no further mediawiki deployments planned on my end for this infra window [17:29:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:30:57] (03CR) 10Scott French: [V:03+2 C:03+2] php8.3: initial release of 8.3 image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French) [17:32:07] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-07-14-122305-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170391 [17:32:58] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [17:34:10] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:34:27] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-07-14-122305-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170391 (owner: 10BryanDavis) [17:36:08] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-07-14-122305-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170391 (owner: 10BryanDavis) [17:36:18] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ganeti1054 - vriley@cumin1002" [17:36:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt ganeti1054 - vriley@cumin1002" [17:36:23] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:36:36] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:36:51] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:36:54] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1054 [17:37:03] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:37:38] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:37:58] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:38:10] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1054 [17:38:18] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:39:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:40:14] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:41:44] (03PS2) 10Reedy: CommonSettings.php: Remove old $wgCentralDBname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348) [17:43:51] (03CR) 10Reedy: CommonSettings.php: Remove old $wgCentralDBname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [17:44:20] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:57:16] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [17:57:30] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:59:54] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:00:04] 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014678 (10Eevans) As a follow-up, I did find a device with a missing bootloader: aqs1014, which went up after it's partman recipe was fixed (it has had SSDs replaced in the years since tho... [18:00:05] dancy and andre: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T1800). [18:00:15] o/ [18:00:26] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc2016 [18:00:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2016 [18:02:21] vriley@cumin1002 provision (PID 1137132) is awaiting input [18:02:22] 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014692 (10CDanis) >>! In T215183#11014363, @Eevans wrote: > Has there been any progress toward goal #2? I didn't see where anything had been added to the mentioned runbook. Good question... [18:03:00] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170396 (https://phabricator.wikimedia.org/T392180) [18:03:02] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170396 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [18:03:32] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [18:03:53] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:03:55] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170396 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [18:05:30] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [18:07:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:08:09] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:09:19] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:10:00] (03CR) 10Brouberol: [C:03+2] Analyics: Refine restore monitor timerange [puppet] - 10https://gerrit.wikimedia.org/r/1170384 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [18:10:03] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:12:01] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.10 refs T392180 [18:12:06] T392180: 1.45.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T392180 [18:12:06] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:12:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:12:32] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:15:56] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:19:39] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host pc2016.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:20:57] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:24:05] (03PS1) 10Eevans: date-gateway-staging: staging not deployed to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) [18:24:36] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans) [18:25:18] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans) [18:26:05] (03PS2) 10Eevans: date-gateway-staging: staging not deployed to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) [18:27:47] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans) [18:27:58] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1054.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:30:06] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1054.eqiad.wmnet with OS bookworm [18:30:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014834 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm [18:30:38] 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014835 (10Eevans) >>! In T215183#11014691, @CDanis wrote: >>>! In T215183#11014363, @Eevans wrote: > > [ ... ] > >> For context: We replaced `sda` in aqs1012 recently (T396970) and were (I... [18:32:03] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2016.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [18:38:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:43:38] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1054.eqiad.wmnet with reason: host reimage [18:44:56] (03CR) 10Dzahn: ":) had no idea, but thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1170275 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [18:45:43] 06SRE, 06Infrastructure-Foundations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183#11014882 (10Eevans) >>! In T215183#11014691, @CDanis wrote: >>>! In T215183#11014363, @Eevans wrote: > > [ ... ] > > I also never spent much time looking at or thinking about RAID10 hosts,... [18:48:26] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1054.eqiad.wmnet with reason: host reimage [18:54:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:54:25] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:54:59] (03PS2) 10Acamicamacaraca: Grant editpatrolprotected to sysops and bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 [18:55:17] (03PS3) 10Acamicamacaraca: Grant editpatrolprotected to sysops and bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881) [19:03:29] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [19:03:53] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [19:03:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1054.eqiad.wmnet with OS bookworm [19:04:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm completed... [19:04:23] (03CR) 10Scott French: [C:03+1] date-gateway-staging: staging not deployed to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans) [19:04:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881) (owner: 10Acamicamacaraca) [19:04:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014925 (10VRiley-WMF) [19:05:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11014926 (10VRiley-WMF) 05Open→03Resolved These have been imaged [19:07:23] (03CR) 10Cwhite: [C:03+2] date-gateway-staging: staging not deployed to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1170400 (https://phabricator.wikimedia.org/T399856) (owner: 10Eevans) [19:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:24:41] 10SRE-swift-storage, 10MediaWiki-File-management: Undeleted file is an incorrect version - https://phabricator.wikimedia.org/T399892#11014984 (10Bugreporter) [19:28:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:32:07] (03PS1) 10Sbisson: CX3 Build 1.0.0+20250717 [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170412 (https://phabricator.wikimedia.org/T388503) [19:32:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170412 (https://phabricator.wikimedia.org/T388503) (owner: 10Sbisson) [19:35:47] FYI, I have a patch scheduled in the upcoming deployment window in about 25 minutes. I'll be a little late but eventually I'll be there and I'll handle my patch. [19:42:56] (03CR) 10Zoranzoki21: "@zivkovica006@gmail.com asked me to review this, but I'm unsure, so I'd like someone else with more knowledge to review this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881) (owner: 10Acamicamacaraca) [19:51:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899 (10REsquito-WMF) 03NEW [19:52:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11015090 (10REsquito-WMF) this ticket is a prerequisite for https://phabricator.wikimedia.org/T396672 and that @dr0ptp4kt is also readying a patch for additional access in ht... [19:58:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11015111 (10HShaikh) Approved. Thank you [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T2000). [20:00:05] Aca and stephanebisson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] *waves* [20:00:28] *waves* [20:01:33] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1170360 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [20:02:56] TheresNoTime Are you around for deployment? :) [20:04:36] Aca are you deploying your patch? [20:05:06] Aca: I can in about 10 minutes [20:05:23] stephanebisson: can you deploy your own patch? [20:05:39] Yes, I'll go ahead if there's no objections [20:05:51] TheresNoTime: I'm discussing Aca's patch with Aca, it might require adding messages to WikimediaMessages. [20:06:00] Kizule: ack [20:06:09] yep, it will require that [20:06:10] stephanebisson: please proceed with your patch [20:06:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170412 (https://phabricator.wikimedia.org/T388503) (owner: 10Sbisson) [20:08:52] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20250717 [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170412 (https://phabricator.wikimedia.org/T388503) (owner: 10Sbisson) [20:09:08] !log sbisson@deploy1003 Started scap sync-world: Backport for [[gerrit:1170412|CX3 Build 1.0.0+20250717 (T388503 T395417 T395418)]] [20:09:22] T388503: Section Translation: Support expanding the existing section if it already exists - https://phabricator.wikimedia.org/T388503 [20:09:22] T395417: CX events EventGate validation errors: translation_source_title should be string - https://phabricator.wikimedia.org/T395417 [20:09:23] T395418: CX events EventGate validation errors: event_source should be string and equal to one of the enum values - https://phabricator.wikimedia.org/T395418 [20:11:08] !log sbisson@deploy1003 sbisson: Backport for [[gerrit:1170412|CX3 Build 1.0.0+20250717 (T388503 T395417 T395418)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:13:11] (03PS2) 10Btullis: Disable all dumps timers on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438) [20:14:38] !log sbisson@deploy1003 sbisson: Continuing with sync [20:15:22] (03CR) 10Dzahn: [V:03+1] "compiled on entire "C:scap" and it's noop - https://puppet-compiler.wmflabs.org/output/1137818/6298/" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [20:15:33] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1170410 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [20:16:00] (03CR) 10Dzahn: [V:03+1 C:03+2] scap: stop hardcoding scap user home to fix puppet breakage [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [20:17:09] TheresNoTime: WikimediaMessages patch is created by Aca as well. It might need a backport to wmf.10 as well. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1170420 [20:17:16] Relevant config patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1170407 [20:17:16] (03CR) 10Dzahn: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [20:18:52] (03CR) 10Dzahn: "more sorry for the delay from my side - i'd still deploy this but it's low priority - maybe I should just reach out to you on IRC when is " [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [20:19:37] Kizule: okay, I'll take a look. It will need backporting to wmf.10 yeah [20:20:18] !log sbisson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170412|CX3 Build 1.0.0+20250717 (T388503 T395417 T395418)]] (duration: 11m 10s) [20:20:26] T388503: Section Translation: Support expanding the existing section if it already exists - https://phabricator.wikimedia.org/T388503 [20:20:26] T395417: CX events EventGate validation errors: translation_source_title should be string - https://phabricator.wikimedia.org/T395417 [20:20:27] T395418: CX events EventGate validation errors: event_source should be string and equal to one of the enum values - https://phabricator.wikimedia.org/T395418 [20:20:39] I'm done [20:20:48] ack :) [20:21:11] nicee [20:21:55] (03PS1) 10Bvibber: Database index hack to speed chartinfo API [extensions/JsonConfig] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170422 (https://phabricator.wikimedia.org/T393950) [20:23:18] if things are clear i'll push that real quick [20:23:28] bvibber: go ahead, I'm waiting on CI :) [20:23:48] tx <3 [20:23:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170422 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber) [20:33:47] (03Merged) 10jenkins-bot: Database index hack to speed chartinfo API [extensions/JsonConfig] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170422 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber) [20:34:39] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1170422|Database index hack to speed chartinfo API (T393950)]] [20:34:44] T393950: Metrics for when new charts are created and embedded - https://phabricator.wikimedia.org/T393950 [20:35:33] Kizule: can you take a look at & +1 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1170420 and I'll +2 it [20:36:30] TheresNoTime: Done [20:36:41] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1170422|Database index hack to speed chartinfo API (T393950)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:37:55] !log bvibber@deploy1003 bvibber: Continuing with sync [20:37:58] confirmed good [20:38:25] (03PS1) 10Zoranzoki21: Add editpatrolprotected messages [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881) [20:38:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881) (owner: 10Zoranzoki21) [20:38:46] (03PS2) 10Samtar: Add editpatrolprotected messages [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881) (owner: 10Zoranzoki21) [20:39:18] TheresNoTime: I made a cherry-pick so CI can finish in time. [20:39:25] *on time [20:39:33] (oh, whoops, also did... hopefully that didn't mess anything up ^^') [20:40:10] TheresNoTime: As we did it on Gerrit, it's fine. :D [20:40:18] Just checked, nothing is different. [20:40:24] :) [20:40:42] nahhh, leave me the job of breaking wikis [20:40:45] :D [20:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:43:26] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170422|Database index hack to speed chartinfo API (T393950)]] (duration: 08m 47s) [20:43:31] T393950: Metrics for when new charts are created and embedded - https://phabricator.wikimedia.org/T393950 [20:43:34] done [20:43:40] :) [20:44:10] ok now for the fun part of the day -- taking the cat into her vet appointment :D [20:44:19] later all :D [20:44:28] good luck! [20:44:36] see ya [20:45:09] hell nah, this check is taking too long [20:46:00] Aca: still got time to deploy it this window? :) [20:46:08] yes [20:46:24] sorry for waiting [20:46:44] In the worst case scenario, I'm here for Aca. :D [20:47:09] I actually thought the messages could be added separately, and then Kizule told me they should be prepared for the deploy as well [20:47:46] so thats the context [20:50:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:50:58] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:51:08] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:52:23] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:52:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881) (owner: 10Zoranzoki21) [20:52:31] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:53:28] (03Merged) 10jenkins-bot: Add editpatrolprotected messages [extensions/WikimediaMessages] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170425 (https://phabricator.wikimedia.org/T399881) (owner: 10Zoranzoki21) [20:53:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11015246 (10aranyap) 05Resolved→03Open Hi @cmooney ! I'm having some trouble trying to access JupyterHub and after some poking around with @dr0ptp4... [20:53:46] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1170425|Add editpatrolprotected messages (T399881)]] [20:53:50] T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881 [20:55:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:55:14] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [20:56:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:58:16] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11015267 (10cmooney) Hi @aranyap yeah you are not in that group. ` cmooney@ldap-maint1001:~$ ldapsearch -x cn=wmf | grep aprum cmooney@ldap-maint1001:... [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250717T2100) [21:01:40] FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:05:38] Kizule: Aca: the deploy is still ongoing for the messages patch, its just being a bit slow having to rebuild some container images [21:05:53] We started wondering what's going on. [21:05:54] ack [21:05:57] It's okay, we can wait. Ack. [21:06:13] (03PS1) 10Ryan Kemper: redfish: fix inconsequential typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170428 [21:06:31] (03CR) 10Ryan Kemper: [C:03+1] redfish: fix inconsequential typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170428 (owner: 10Ryan Kemper) [21:06:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:06:55] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:13:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11015297 (10ssingh) [Claiming this as the clinic duty person this week] @aranyap: https://ldap.toolforge.org/user/aranyap indicates you are not part o... [21:14:00] (03CR) 10Ssingh: "Hi. Sorry about this. Let's deploy this on Monday; please ping us whenever you are around and/or feel free to send a calendar invite." [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [21:16:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for resquito - https://phabricator.wikimedia.org/T399899#11015321 (10ssingh) a:03ssingh [21:16:55] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:18:11] !log samtar@deploy1003 zoranzoki21, samtar: Backport for [[gerrit:1170425|Add editpatrolprotected messages (T399881)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:18:15] T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881 [21:18:17] Finally! [21:18:22] oh god [21:18:47] will just continue, that doesn't need testing does it? [21:18:57] Nope [21:19:01] checkin [21:19:07] LGTM [21:19:11] !log samtar@deploy1003 zoranzoki21, samtar: Continuing with sync [21:19:28] MediaWiki pages exist now [21:19:29] (03CR) 10Ryan Kemper: [C:03+2] redfish: fix inconsequential typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170428 (owner: 10Ryan Kemper) [21:21:55] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:24:53] (03PS2) 10Ryan Kemper: Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) [21:25:55] (03CR) 10Ryan Kemper: "Pushing a patch to a" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [21:28:32] (03Merged) 10jenkins-bot: redfish: fix inconsequential typo [software/spicerack] - 10https://gerrit.wikimedia.org/r/1170428 (owner: 10Ryan Kemper) [21:31:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#11015340 (10aranyap) @cmooney @ssingh I just requested access through the online system. Thank you! [21:31:35] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170425|Add editpatrolprotected messages (T399881)]] (duration: 37m 49s) [21:31:39] T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881 [21:31:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881) (owner: 10Acamicamacaraca) [21:32:17] Kizule: Aca: now deploying the config change [21:32:29] Nice :) [21:33:01] ack [21:33:26] (03CR) 10Dzahn: "thank you, Sukhbir, sounds good:)" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [21:33:50] (03CR) 10CI reject: [V:04-1] Replace elasticsearch api with python requests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [21:33:59] PROBLEM - SSH on bast7002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:35:08] (03Merged) 10jenkins-bot: Grant editpatrolprotected to sysops and bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170407 (https://phabricator.wikimedia.org/T399881) (owner: 10Acamicamacaraca) [21:35:21] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1170407|Grant editpatrolprotected to sysops and bots (T399881)]] [21:35:59] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:39:26] !log samtar@deploy1003 aleksandar, samtar: Backport for [[gerrit:1170407|Grant editpatrolprotected to sysops and bots (T399881)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:39:31] T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881 [21:40:07] Kizule: Aca: ready to test [21:40:10] checkin [21:40:44] protection level is now displayed in the menu, LGTM [21:40:46] no-op on srwiki, so it's fine. [21:41:06] !log samtar@deploy1003 aleksandar, samtar: Continuing with sync [21:41:59] PROBLEM - SSH on bast7002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:42:59] RECOVERY - SSH on bast7002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:47:58] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170407|Grant editpatrolprotected to sysops and bots (T399881)]] (duration: 12m 37s) [21:48:03] T399881: Serbo-Croatian sysops can't protect page to patrollers only - https://phabricator.wikimedia.org/T399881 [21:48:09] done finally! [21:48:22] Thank you for the deployyy [21:48:30] Thanks, all good! [21:48:32] no worries :) [21:56:55] (03PS1) 10Dzahn: gerrit: replace host names in replica config with variables [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) [21:57:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:59:33] (03PS2) 10Dzahn: gerrit: replace host names in replica config with variables [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) [22:01:23] (03PS3) 10Dzahn: gerrit: replace host names in replica config with variables [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) [22:01:25] !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: repool eqsin to test backhaul cct packet loss, T399221] [22:01:29] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: repool eqsin to test backhaul cct packet loss, T399221] [22:01:29] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [22:02:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:03:15] (03CR) 10Dzahn: [C:03+1] "noop per https://puppet-compiler.wmflabs.org/output/1170433/6302/" [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [22:03:23] (03CR) 10Dzahn: [V:03+1] gerrit: replace host names in replica config with variables [puppet] - 10https://gerrit.wikimedia.org/r/1170433 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [22:08:00] jouncebot: nowandnext [22:08:00] No deployments scheduled for the next 7 hour(s) and 51 minute(s) [22:08:00] In 7 hour(s) and 51 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250718T0600) [22:08:10] (03CR) 10Zabe: [C:03+2] PendingChangesPager: Stop using ANSI-89 joins [extensions/FlaggedRevs] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170318 (https://phabricator.wikimedia.org/T399641) (owner: 10Jforrester) [22:16:53] (03Merged) 10jenkins-bot: PendingChangesPager: Stop using ANSI-89 joins [extensions/FlaggedRevs] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170318 (https://phabricator.wikimedia.org/T399641) (owner: 10Jforrester) [22:17:20] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1170318|PendingChangesPager: Stop using ANSI-89 joins (T399641)]] [22:17:25] T399641: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Pager\IndexPager::buildQueryInfo (PendingChangesPager)Query: SELECT page_namespace,page_title,page_len,rev_len,page_latest,fp - https://phabricator.wikimedia.org/T399641 [22:19:19] !log zabe@deploy1003 jforrester, zabe: Backport for [[gerrit:1170318|PendingChangesPager: Stop using ANSI-89 joins (T399641)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:20:11] !log zabe@deploy1003 jforrester, zabe: Continuing with sync [22:25:28] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170318|PendingChangesPager: Stop using ANSI-89 joins (T399641)]] (duration: 08m 08s) [22:25:32] T399641: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cl_target_id' in 'ON'Function: MediaWiki\Pager\IndexPager::buildQueryInfo (PendingChangesPager)Query: SELECT page_namespace,page_title,page_len,rev_len,page_latest,fp - https://phabricator.wikimedia.org/T399641 [22:28:55] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:29:13] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:30:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:30:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:30:55] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:31:13] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:35:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:35:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:40:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:50:40] FIRING: [7x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:54:10] PROBLEM - MariaDB Replica Lag: s2 #page on db1229 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 21562.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:54:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:54:35] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/2 (inter.link reserved port) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:55:40] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:56:11] o/ [22:56:16] can someone depool db1229? [22:56:16] !incidents [22:56:16] 6479 (UNACKED) db1229 (paged)/MariaDB Replica Lag: s2 (paged) [22:56:24] it's not pooled AFAICT [22:56:24] thanks swfrench-wmf <3 [22:56:27] oh ok [22:56:38] trying to figure out what's up [22:56:45] !ack 6479 [22:56:46] 6479 (ACKED) db1229 (paged)/MariaDB Replica Lag: s2 (paged) [22:57:31] !log [cumin1002:~] $ sudo dbctl instance db1229 depool [22:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:57:46] mutante: it wasn't pooled [22:58:05] is it one that was just reimaged, similar to es* the other day? [22:58:11] eh, I mean.. kernel reboots! [22:58:26] mutante: doesn't look like it: uptime 45d [22:59:07] it was depooled at 16:53 today [22:59:21] https://sal.toolforge.org/production?p=0&q=db1229&d= [22:59:32] ah, downtime expired [22:59:36] (6h) [22:59:37] https://phabricator.wikimedia.org/P79350 [23:00:18] swfrench-wmf: yep, I see that in the SAL as well [23:00:21] any objections if I re-created the downtime and flag in -data-persistence? [23:00:30] SGTM [23:00:40] FIRING: [2x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:00:50] I have the downtime cookbook open.. same thing [23:00:54] swfrench-wmf: ok, please do [23:01:12] a comment on https://phabricator.wikimedia.org/T399249 should do [23:01:39] ah, thanks for finding the task! [23:03:23] !log swfrench@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance - T399249 [23:03:27] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:03:47] thanks, left comment on ticket and IRC [23:04:33] cya all later again:) [23:04:43] Resolving the page. [23:04:53] awesome [23:05:03] oh, I thought it was already done based on cortobot, thanks [23:05:06] {{done}} thanks for the quick response y'all! [23:05:11] cwhite: thank you, I always forget to do that and am unpleasantly surprised the next day :) [23:05:12] same to you [23:05:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:05:59] glad to help! [23:06:33] in this context.. also: https://phabricator.wikimedia.org/T396816 [23:06:35] * mutante waves [23:06:53] :) [23:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:10:40] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:17:40] FIRING: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:21:55] RESOLVED: [3x] BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:38:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170443 [23:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170443 (owner: 10TrainBranchBot) [23:41:16] (03PS1) 10Dzahn: zuul::main: install apparmor-utils, needed for docker [puppet] - 10https://gerrit.wikimedia.org/r/1170444 (https://phabricator.wikimedia.org/T395938) [23:42:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169737 (owner: 10Krinkle) [23:42:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:43:01] (03Merged) 10jenkins-bot: multiversion: Fix "Class Wikimedia\MWConfig\Exception not found" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169737 (owner: 10Krinkle) [23:43:14] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1169737|multiversion: Fix "Class Wikimedia\MWConfig\Exception not found"]] [23:45:10] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1169737|multiversion: Fix "Class Wikimedia\MWConfig\Exception not found"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:50:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170443 (owner: 10TrainBranchBot) [23:55:40] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:59:48] !log krinkle@deploy1003 krinkle: Continuing with sync